Contemporary search engines are based on information retrieval technology, which finds and ranks relevant documents for a query, and then returns a ranked list. Many ranking models have been proposed in information retrieval; recently machine learning techniques have also been applied to constructing ranking models. However, existing methods do not take into consideration the fact that significant differences exist between types of queries.
This Summary is provided to introduce a selection of representative concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used in any way that would limit the scope of the claimed subject matter.
Briefly, various aspects of the subject matter described herein are directed towards a technology by which a query is processed, including to find documents for the query. The documents are ranked using a ranking model for the query that is selected/determined based upon the query. In one aspect, nearest neighbor concepts (of the query in query feature space) are used to determine/select the ranking model.
In one aspect, selection/determination of the ranking model is performed by training the ranking model online, based on a training set obtained from a number of nearest neighbors to the query. In an alternative aspect, selection/determination of the ranking model includes training a plurality of ranking models offline with a corresponding plurality of training sets, finding a most similar training set based on nearest neighbors of the query, and selecting as the ranking model the model that corresponds to the most similar training set. In another alternative aspect, selection/determination of the ranking model includes training a plurality of ranking models offline with a corresponding plurality of training sets, finding a nearest neighbor to the query, and selecting the ranking model that is associated with the training set that corresponds to the nearest neighbor of the query.
Other advantages may become apparent from the following detailed description when taken in conjunction with the drawings.
The present invention is illustrated by way of example and not limited in the accompanying figures in which like reference numerals indicate similar elements and in which:
Various aspects of the technology described herein are generally directed towards employing different ranking models for different queries, which is referred to herein as “query-dependent ranking.” In one implementation, query-dependent ranking is based upon a K-Nearest Neighbor (KNN) method. In one implementation, an online method creates a ranking model for a given query by using the labeled neighbors of the query in query feature space, with the retrieved documents for the query then ranked by using the created model. Alternatively, offline approximations of KNN-based query-dependent ranking are used, which creates the ranking models in advance to enhance the efficiency of ranking.
It should be understood that any of the examples described herein are non-limiting examples. As such, the present invention is not limited to any particular embodiments, aspects, concepts, structures, functionalities or examples described herein. Rather, any of the embodiments, aspects, concepts, structures, functionalities or examples described herein are non-limiting, and the present invention may be used various ways that provide benefits and advantages in computing and query processing in general.
In general, training queries from a set of training data 102 are featurized in a known manner into a query feature space 104, as represented by the featurizer block 106. In other words, for each training query qi (with corresponding training data as Sq
When a new query 108 is processed, its features are similarly extracted (e.g., by the featurizer block 106) and used to locate one or more of its nearest neighbors, as represented by the block 110. As is readily understood, the query features that are used determine the accuracy of the process. While many ways to derive query features are feasible, one implementation used a heuristic method to derive query features, namely, for each query q, a reference model (e.g., BM25) is used to find its top T documents; note that the featurizer block 106 is also shown as incorporating the reference model. Once these are found, the process takes a mean of the feature values of the T documents as a feature of the query. For example, if a feature of the document is tf-idf, (term frequency-inverse document frequency) then the corresponding query feature becomes the average tf-idf of the top T documents of the query. If there are many relevant documents, then it is very likely that the value of the average tf-idf is high.
To locate the nearest neighbors, given the new query 108, the k closest training queries to it in terms of Euclidean distance in feature space are found, as represented via block 112. The new query is also processed (e.g., as represented by block 114) to find relevant documents 116, which are unranked.
Unlike conventional ranking mechanisms that simply rank the documents, a local ranking model 118 is selected that depends on the query. In the online version, the local ranking model 118 is trained online using the neighboring training queries 112 (denoted as Nk(q)). In the offline versions, the local ranking models are trained in advance, with nearest neighbor concepts applied in selecting a local ranking model, e.g., based on a most similar training set, or based on the local ranking model associated with a nearest neighbor.
Once trained and/or selected, the documents 116 of the new query are then ranked using the trained local model 118, as represented by the ranked documents 120, which are returned in response to the query. As can be seen, in any alternative the overall process employs a k-nearest neighbor-based method for query dependent ranking.
For training the local ranking model 118, any existing stable learning to rank algorithm may be used. One such algorithm that was implemented is Ranking SVM. Note that Sq
The online training process is referred to as “KNN Online”.
Example steps of a suitable KNN online algorithm are presented in the flow diagram of
As mentioned above, part of the online algorithm is able to use some offline pre-processing as represented by steps 304-306, namely for each training query qi, the reference model hr is used to find its top T documents, and its query features computed from the documents.
The online training and using of the local model is represented beginning at step 308, where the reference model hr is again used to find the top T documents, this time for the input query q, in order to compute its query features. Step 310 finds the k nearest neighbors of q, denoted as Nk(q) in the training data in the query feature space.
Given the nearest neighbors, at step 312 the training set
is used to learn a local model hq, Step 314 applies hq to the documents associated with the query q, and obtains the ranked list. Step 316 represents the output of the ranked list for the query q.
As can be readily appreciated, the time complexity of the KNN Online algorithm is relatively high, with most of the computation time resulting from online model training and finding the k nearest neighbors. Model training is time consuming; for example, the time complexity of training a Ranking SVM model is of polynomial order in number of document pairs. When finding k nearest neighbors in the query feature space, using a straightforward search algorithm, the time complexity is of order 0(m log m), where m is the number of training queries.
To reduce the aforementioned time complexity, two alternative algorithms are described herein, which in general move the time-consuming steps to offline. These alternative algorithms are referred to KNN Offline-1 and KNN Offline-2.
KNN Offline-1 moves the model training step to offline. In general, for each training query qi, its k nearest neighbors Nk (qi) are found in the query feature space. Then, a model hq
When testing, for a new query q, its k nearest neighbors Nk(q) are also found. Then, the algorithm compares SN
where |.| denotes the number of instances in a set.
Next, the model of the selected training set hq
Unlike the online algorithm, steps 508-510 are used to learn a local model offline. To this end, for each training query qi, step 509 finds the k nearest neighbors of qi, denoted as Nk(qi) in the training data in the query feature space, and uses the training set SN
The online operation of the Offline-1 algorithm is exemplified in
Then, step 606 finds the most similar training set SN
The KNN Offline-1 algorithm avoids online training, however, it introduces additional computation when searching the most similar training set. Also, it still needs to find the k nearest neighbors of the test query online, which is also time-consuming. As online response time is a significant consideration for search engines, yet another alternative algorithm, referred to as KNN Offline-2, may be used to further reduce the time complexity.
A general idea in the KNN Offline-2 is that instead of searching the k nearest neighbors for the test query q, only its nearest neighbor in the query feature space is found. For example, if the nearest neighbor is qi*, only the model hq
The invention is operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well known computing systems, environments, and/or configurations that may be suitable for use with the invention include, but are not limited to: personal computers, server computers, hand-held or laptop devices, tablet devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.
The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, and so forth, which perform particular tasks or implement particular abstract data types. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in local and/or remote computer storage media including memory storage devices.
With reference to
The computer 910 typically includes a variety of computer-readable media. Computer-readable media can be any available media that can be accessed by the computer 910 and includes both volatile and nonvolatile media, and removable and non-removable media. By way of example, and not limitation, computer-readable media may comprise computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer-readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by the computer 910. Communication media typically embodies computer-readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media. Combinations of the any of the above may also be included within the scope of computer-readable media.
The system memory 930 includes computer storage media in the form of volatile and/or nonvolatile memory such as read only memory (ROM) 931 and random access memory (RAM) 932. A basic input/output system 933 (BIOS), containing the basic routines that help to transfer information between elements within computer 910, such as during start-up, is typically stored in ROM 931. RAM 932 typically contains data and/or program modules that are immediately accessible to and/or presently being operated on by processing unit 920. By way of example, and not limitation,
The computer 910 may also include other removable/non-removable, volatile/nonvolatile computer storage media. By way of example only,
The drives and their associated computer storage media, described above and illustrated in
The computer 910 may operate in a networked environment using logical connections to one or more remote computers, such as a remote computer 980. The remote computer 980 may be a personal computer, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to the computer 910, although only a memory storage device 981 has been illustrated in
When used in a LAN networking environment, the computer 910 is connected to the LAN 971 through a network interface or adapter 970. When used in a WAN networking environment, the computer 910 typically includes a modem 972 or other means for establishing communications over the WAN 973, such as the Internet. The modem 972, which may be internal or external, may be connected to the system bus 921 via the user input interface 960 or other appropriate mechanism. A wireless networking component 974 such as comprising an interface and antenna may be coupled through a suitable device such as an access point or peer computer to a WAN or LAN. In a networked environment, program modules depicted relative to the computer 910, or portions thereof, may be stored in the remote memory storage device. By way of example, and not limitation,
An auxiliary subsystem 999 (e.g., for auxiliary display of content) may be connected via the user interface 960 to allow data such as program content, system status and event notifications to be provided to the user, even if the main portions of the computer system are in a low power state. The auxiliary subsystem 999 may be connected to the modem 972 and/or network interface 970 to allow communication between these systems while the main processing unit 920 is in a low power state.
While the invention is susceptible to various modifications and alternative constructions, certain illustrated embodiments thereof are shown in the drawings and have been described above in detail. It should be understood, however, that there is no intention to limit the invention to the specific forms disclosed, but on the contrary, the intention is to cover all modifications, alternative constructions, and equivalents failing within the spirit and scope of the invention.