Recent years have witnessed an explosive growth of data available on the Internet. As the amount of data has grown, so has the need to be able to locate relevant data and rank the data according to its relevance. Ranking is a key issue in many applications, such as information retrieval applications which retrieve data, such as documents, in response to a query. Ranking can provide an indication of whether retrieved documents may be relevant to the query or include information sought for in the query.
One approach to determining the relevance of data and ranking the data is to use machine learning techniques. Machine learning techniques may use sets of training data to learn relevance and ranking functions. A common assumption, however, is that the relevance labels of training data (e.g., training documents) are reliable. In many cases, this is not so. For example, when multiple human annotators are tasked to label the same document for its relevance to a query, there are often annotators who disagree with the majority. This indicates a likelihood that training data that is annotated by a single annotator (which is common in practice) will contain noise (i.e. some discrepancy as compared with a majority of multiple annotators).
This is understandable when considering the generally short and ambiguous nature of most queries, and the amount of information in documents (e.g., web pages, etc.), relative to different aspects of a query. Without knowing the intent of a query, for example, it can be difficult to know which aspects of the query are the most important. Further, relevance judgments can be more subjective than objective, since they are often dependent on the annotator's own perspective.
Using traditional learning techniques with noisy training data may create low quality ranking models.
This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter. The term “techniques,” for instance, may refer to device(s), system(s), method(s) and/or computer-readable instructions throughout the document.
In one aspect, the application describes automatically determining a relevance of an object, such as for example, a document, to a query, using a graphical model. In some embodiments, the graphical model shows relationships between an observed label for the object, the actual (i.e., true) label for the object, features of the object, and weights of the features. The relationships may be modeled using one or more observed and/or hidden modeling parameters.
The determining may include receiving a set of training data for a machine learning technique that may contain noise. At least one modeling parameter for the graphical model is learned by maximizing a log likelihood of the training data. Noise in the training data and a ranking function are modeled using the graphical model, based on the at least one modeling parameter. The relevance of the document may be determined using input from the graphical model, and outputted. In one embodiment, an output includes relevance data arranged by rank.
In alternate embodiments, iterative techniques such as regression may be employed to learn one or more modeling parameters.
The Detailed Description is set forth with reference to the accompanying figures. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. The use of the same reference numbers in different figures indicates similar or identical items.
Various techniques for determining the relevance of an object using a noise tolerant ranking model are disclosed. For ease of discussion, the disclosure describes the various techniques with respect to a document, for example, a document resulting from a query. However, the descriptions also may be applicable to determining the relevance to a query or other input of other objects such as a video, an audio file, another media file, a data file, a text file, and the like.
In one embodiment, techniques are employed to automatically determine the relevance of a document (i.e., a web page, a text document, etc.) to a query, for example, a search engine query. For example, a user may initiate a web-search based on the query “machine learning.” In this case, the techniques discussed herein determine the relevance of documents that are returned by the search engine in response to the query. Various web sites, web pages, portable documents, media files, data files, and the like may be returned, with the relevance determined for each returned object. Additionally, in some embodiments, the returned documents may be returned in the order of their ranked relevance. In alternate embodiments, techniques may be employed to present other outputs (e.g., a database of the results, one or more annotated tables, customized reports, etc.) to a user.
Various techniques for determining a relevance of an object are disclosed. The discussion herein includes several sections. Each section is intended to be non-limiting. More particularly, this entire description is intended to illustrate components which may be utilized in determining the relevance of an object, but not components which are necessarily required. An overview of a system or technique for determining a relevance of an object is given with reference to
In general, techniques are disclosed for determining the relevance of an object, based on learning to rank from (assumed) noisy data, using a noise tolerant probabilistic graphical model. In one embodiment, the noise tolerant graphical model is a probabilistic model. The use of a probabilistic graphical model may benefit from advantages, including:
In one embodiment, the system 102 may be connected to a network 110, and may receive the objects 106 from locations on the network 110. In the example of
Example systems for determining the relevance 104 of an object 106, for example, to a query 108 are discussed with reference to
All or portions of the subject matter of this disclosure, including the modeling component 112, the analysis component 114 and/or the output component 116 (as well as other components, if present) can be implemented as a system, method, apparatus, or article of manufacture using standard programming and/or engineering techniques to produce software, firmware, hardware or any combination thereof to control a computer or processor to implement the disclosure. For example, an example system 102 may be implemented using any form of computer-readable media (shown as memory 120 in
Computer-readable storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer-readable instructions, data structures, program modules or other data. Memory 120 is an example of computer-readable storage media. Additional types of computer-readable storage media that may be present include, but are not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which may be used to store the desired information and which may accessed by the processor 118.
In contrast, communication media typically embodies computer readable instructions, data structures, program modules, or other data in a modulated data signal, such as a carrier wave, or other transport mechanism.
While the subject matter has been described above in the general context of computer-executable instructions of a computer program that runs on a computer and/or computers, those skilled in the art will recognize that the subject matter also may be implemented in combination with other program modules. Generally, program modules include routines, programs, components, data structures, and the like, which perform particular tasks and/or implement particular abstract data types.
Moreover, those skilled in the art will appreciate that the innovative techniques can be practiced with other computer system configurations, including single-processor or multiprocessor computer systems, mini-computing devices, mainframe computers, as well as personal computers, hand-held computing devices (e.g., personal digital assistant (PDA), phone, watch . . . ), microprocessor-based or programmable consumer or industrial electronics, and the like. The illustrated aspects may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. However, some, if not all aspects of the disclosure can be practiced on stand-alone computers. In a distributed computing environment, program modules may be located in both local and remote memory storage devices.
In one example embodiment, as illustrated in
If included, the analysis component 114 (as shown in
In an implementation, the analysis component 114 is configured to associate the object 106 with at least two random variables to determine the relevance of the object 106: a hidden variable representing the actual label and an observable variable representing the observed label. In an example, the hidden and observable variables are modeling parameters of the graphical model 200. The hidden and observable parameters in these and other examples are discussed further in a later section.
If included, the output component 116 (as shown in
In various embodiments, the relevance 104 of the object 106 may be indicated by a prioritized or ranked list. In one example of the prioritized or ranked list, the output component 116 may output the relevance 104 of an object (for example 106A) with respect to another object (for example 106B) and output the relevance 104 of the objects in an arrangement according to their respective rankings. This provides an indication of the relative relevance of 106A and 106B. In other examples, the prioritized or ranked list may contain any number of objects and their relative relevance to a query 108. Additionally or alternatively, the relevance 104 of the object 106 may be presented in the form of a general or detailed analysis, and the like.
In one embodiment, the output of the system 102 is displayed on a display device (not shown). In alternate embodiments, the display device may be any device for displaying information to a user (e.g., computer monitor, mobile communications device, personal digital assistant (PDA), electronic pad or tablet computing device, projection device, imaging device, and the like). For example, the relevance 104 may be displayed on a user's mobile telephone display (in the case of a query performed from a mobile browser, for example). In alternate embodiments, the output may be provided to the user by another method (e.g., email, posting to a website, posting on a social network page, text message, etc.).
An example graphical model 200 is shown in the illustration of
In one example embodiment, the graphical model 200 may be comprised of parameters (e.g., variables, vectors, quantities, etc.), rules (e.g., equations, conditions, constraints, etc.), relationships, probabilities, and the like, arranged to assist in determining a relevance of an object 106 to a query 108. In various embodiments, this may include determining the actual relevance label of the object 106.
Example elements of the graphical model 200 include the actual label y, which represents the actual relevance label of the object 106. The actual label y is initially hidden since it is not readily observable or apparent, but it may be determined by the techniques described. The observed label of the object 106 is represented by {tilde over (y)}. The observed label {tilde over (y)} is a label that has been annotated to the object 106. In an embodiment, the observed label {tilde over (y)} is an initially proposed relevance label for the object 106, indicating the object's relevance to a query 108. For the purposes of the graphic model 200, it may be assumed that the observed label {tilde over (y)} is noisy.
As shown in
Other example elements of the graphical model 200 include the parameter x, which represents one or more observable features of the object 106 (i.e., object features). In various embodiments, the parameter x is flexible, meaning that it may not be dependent on specific features of the object 106. In an embodiment, the parameter x represents aspects of the object 106 that determine its relevance. For example, in some embodiments, x is a feature vector representing observable relevancy features of the object 106, such as, the number of times the object 106 has been accessed in a specified time frame (e.g., access of a web page, etc.), the number of times a term or component (such as a word or phrase, for example) is found in an object 106, the frequency that the object 106 is updated (e.g., updates to a web page, etc.), and the like.
As shown in
In one embodiment, the graphical model 200 describes a joint probability distribution of the actual label y and the observed label {tilde over (y)}, given one or more features x of the object 106. The joint probability distribution may include a conditional probability of the actual label y given the one or more features x of the object 106, and considering the weight parameter ω of the one or more features x. The conditional probability may be described using equations shown in a later section.
The graphical model 200 may be applied as part of an example technique to learn ranking with noisy training data. For example, techniques may be used with reference to a query q (not shown). In an embodiment, nq denotes the number of documents associated with query q, d denotes the number of document features, and k denotes the number of possible relevance labels. Additionally, (xq, {tilde over (y)}q) may be used to denote the data associated with query q in a training set, where xq is an nq×d matrix with the i-th row xqi representing the feature vector of the i-th object 106, and {tilde over (y)}q∈{0, 1, . . . , k−1}n
As discussed above, it may be assumed that labels assigned to or annotated to objects 106 contain noise. Accordingly, the hidden element yq∈{0, 1, . . . , k−1}n
The aforementioned decomposition can be written:
P(yq,{tilde over (y)}q|xq;ω,γq)=P(yq|xq;ω)P({tilde over (y)}q|yq;γq). Eqn. (1)
Then, the likelihood of the training data S={(xq,{tilde over (y)}q)}q=1m may be written:
where L (ω,γ) represents the log likelihood of the parameters ω and γ.
The two conditional probabilities (A and B) as incorporated into equation (2) are defined in the following subsections. For ease of the remainder of the discussion, the superscript q is implied on the terms (as shown above), but may not be written.
Conditional Probability A: P(y|x;ω)
In one embodiment, the first conditional probability P(y|x;ω) is defined using a conditional random field (CRF) according to the equation:
where I(•) is the indicator function, and
Z(x)=Σyexp{ΣiΣjωT(xi−xj)I(yi>yj)}.
Each object feature xi is mapped to a score using the parameter ω, and then the scores of the objects 106 are checked for consistency with their actual relevance labels. For example, the consistency may be measured by checking every pair of objects 106 with yi>yj (where yi is an actual relevancy label of an i-th object and yi is an actual relevancy label of a j-th object) to determine whether the score of the first object 106 is larger than that of the second one. The larger the difference implies a higher probability P(y|x).
Thus, by using the above formulation, the feature functions in the CRF are defined as pairwise comparisons between two different objects 106.
Conditional Probability B: P({tilde over (y)}|y;γ)
In an embodiment, the second probability P({tilde over (y)}|y;γ) is defined based on a multinomial noise model. First, given the actual label y, the noisy label {tilde over (y)} is assumed to be independent of the object features x, but not independent of the query q. The noisy label {tilde over (y)} is dependent on the query q because it depends on the parameter γ, which is query specific. In this way, the graphical model 200 can reflect that some queries may be more likely to be judged (i.e., annotated, labeled, etc.) mistakenly, as discussed above. The probability may be first defined as:
P({tilde over (y)}|y;γ)=ΠiP({tilde over (y)}i|yi;γ) Eqn. (4)
Second, for a query q, it is assumed that the objects 106 that result are correctly labeled with a probability 1−γ and incorrectly labeled with a probability γ, with each of the k−1 incorrect labels being equally likely. Then, P({tilde over (y)}i|yi;γ) can be represented as:
Combining equations (4) and (5) results in the equation:
As is shown above, in the described embodiment, a query-dependent multinomial distribution (i.e., the parameters γ are different for different queries), may be used herein to define the second conditional probability.
In various embodiments, a learning algorithm is used to learn and infer elements of the graphical model 200. Given a set of training data S={(xq, {tilde over (y)}q)}q=1m, the parameters ω and γ of the graphical model 200 can be learned by maximum likelihood estimation. Then, the parameter ω can be used to rank the objects 106 for a query q.
In one embodiment, one or more of the model parameters ω and γ of the graphical model 200 may be learned by maximizing a log likelihood (see equation (2)) of the training data. An example learning algorithm may be expressed as:
In one embodiment, maximizing a log likelihood of the set of training data includes iterating an expectation maximization (EM) technique on the set of training data until the iterations converge. In one implementation, the EM technique iterates between an E (expectation) step and an M (maximization) step. For example, the maximizing may include iteratively performing operations of: estimating an expected value of the log likelihood of the training data, with respect to the probability of the relevance of the document, given feature vectors of the document, a proposed relevance of the document, and an estimate of the modeling parameter (E step); and selecting a modeling parameter that maximizes the expected value of the log likelihood (M step).
In one implementation, the E step includes estimating the expected value of the log-likelihood of the complete data, log P(yq, {tilde over (y)}q|xq;ω,γq), with respect to the probability of the hidden variable yq, given the observation (yq,xq) and the current parameter estimates (ωt, γq,t) (estimated in the t-th iteration). When the expectation function is denoted as T(ω, γ|ωt, γt), then the expected log-likelihood expression may be written as:
T(ω,γ|ωt,γt)=ΣqΣy
Substituting equation (2) into equation (8) results in:
T(ω,γ|ωt,γt)=T1(ω)+T2(γ), Eqn. (9)
where:
In an embodiment, the entire yq space is summed in equations (10), (11), and (12), which consist of 2n
In one implementation, the E step includes choosing parameters that maximize the expectation computed in the E step:
(ωt+1,γt+1)=arg maxω,γT(ω,γ|ωt,γt). Eqn. (15)
Combining equations (9), (11), and (15) results in:
In an implementation, T(ω, γ|ωt, γt) is concave with regards to ω. In such an implementation, a gradient assent approach may be used to update the parameter ω.
In various embodiments, when the E step and M step iterations converge, estimates of parameters ω and yq are obtained. The parameter γq can indicate the level of noise for the training query q, and the parameter ω can be used to perform ranking on new queries.
With one or more parameters of the graphical model 200 determined, objects 106 resulting from a new query may be ranked for relevance to the query. Given a new query, the actual relevance label y is inferred for its objects by maximizing P(y|x; ω). The actual label may be denoted as y*=argmaxP(y|x; ω). Then, the objects 106 are sorted according to their actual labels. In some embodiments, there may be multiple actual labels y*, where one or more of the actual labels y* can maximize the probability P(y|x; ω). In such cases, S* can be used to denote the set of actual relevance labels. This may be expressed as:
P(y|x;ω)>P(z|x;ω), ∀y∈S*, z∉S*. Eqn. (17)
In one embodiment, the inference process discussed above includes sorting the objects 106 in descending order of their scores ωTx. This produces a ranked list of the objects 106 that is consistent with the set of actual relevance labels S*. This result is described by the theorem: Suppose π* is the permutation according to the descending order of ωTx, then π* is consistent with S*.
For the purposes of this application, the definition of consistency as it applies to the above theorem is given as: Suppose that π is a permutation, π(i) denotes the position of the i-th object, S={y|y∈{0, 1, . . . , k−1}n} is a set of labels. Then π is consistent with S if
π(i)<π(j), ∀yi>yj, ∀y∈S Eqn. (18)
where π(i)<π(j) means the i-th object is ranked before the j-th document.
At block 302, a system or device receives a set of training data, which includes one or more objects. In one example, the system or device may be configured as system 102 and the one or more objects may be configured as objects 106A-106N, as seen in
At block 304, a modeling parameter for a graphical model (such as graphical model 200, for example) may be learned. In various embodiments, one or more modeling parameters are learned for the graphical model. Modeling parameters may include feature vectors of the objects (such as features x), weights of features of the objects (such as weights ω), noise parameters (such as noise parameter γ), or other parameters. For example, in one implementation, a modeling parameter represents a degree of noise in a proposed relevance of a document, where the modeling parameter is dependent on a query associated to the document. In alternate embodiments, some modeling parameters may be observable, and others may be hidden or initially hidden.
In one example, the method may include learning hidden or initially hidden modeling parameters for the graphical model by maximizing a log likelihood of the set of training data. The maximizing may include iterating an expectation maximization (EM) technique on the set of training data until the iterations converge. For instance, the EM technique may include iteratively performing operations of: estimating an expected value of the log likelihood of the training data with respect to a probability of the relevance of the document, given feature vectors of the document, a proposed relevance of the document, and an estimate of the modeling parameter; and selecting a modeling parameter that maximizes the expected value of the log likelihood. When the iterations converge, the resulting modeling parameter can be used in the graphical model.
In another example, the method may include updating one or more of the modeling parameters using a gradient assent technique. As shown in Eqn. (7), the log likelihood log L(ω,γ) is maximized. To use an example gradient assent technique, first compute the gradients of parameters:
First randomly initialize the parameters. Supposing the parameters are ω0 and γ0, then, iteratively update the parameters with an example algorithm as follows:
At block 306, noise in the training data is modeled with the graphical model. In one embodiment, the method includes using the modeling parameter(s) (e.g., features, weights, etc.) from block 304 to model the noise in the training data.
At block 308, a ranking function models the training data using the graphical model. In one embodiment, the model for the noise in the training data is separate and independent from the model of the ranking function for the training data. In an alternate embodiment, the models for the noise and the ranking function are integrated into the same graphical model.
In various embodiments, the graphical model may be configured to capture (1) a conditional dependency of an actual label of the document on the features of the document, and (2) a conditional dependency of an observed label of the document on the actual label of the document. For example, the graphical model is configured to distinguish the actual label of the document from the observed label of the document, where the graphical model is configured to model noise based on the query.
At block 310, a relevance of the object is determined. In the example of
In one embodiment, the method includes receiving a new relevance query, for example, from a user. The query may include a search query, for instance. In an implementation, the relevance of the object(s) returned from the query is determined based on the graphical model and the query.
In one embodiment, the method may include extracting features from the objects that are the result of a query to improve the relevance determination. For example, the extracted features may include the number of times a term or phrase appears within a document, the number of visits or “hits” a document accumulates within a time frame, the frequency that the document (or file, etc.) is updated, and the like.
At block 312, the determined relevance label (such as actual label y) may be associated to the object and output to one or more users. In alternate embodiments, the output may be in various electronic or hard-copy forms. For example, in one embodiment, the output is a searchable, annotated database that includes relevance ranking of the objects for ease of browsing, searching, and the like.
Although implementations have been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts are disclosed as illustrative forms of illustrative implementations. For example, the methodological acts need not be performed in the order or combinations described herein, and may be performed in any combination of one or more acts.