1. Technical Field
The present invention relates to semantic indexing, and, more particularly, to reducing ranking errors in semantic indexing systems and methods.
2. Description of the Related Art
Supervised Semantic Indexing (SSI) models are trained using a set of queries and documents regarded as good matches for the queries. There are several practical challenges that arise when applying this scheme. In particular, there are many sources of ranking errors that can affect the performance of the model. For example, two substantial problems that can cause ranking errors are a lack of training data and changes in the distribution of queries over time. Here, a lack of training data can cause the model to overfit the data. In addition, changes in query distributions may render the SSI model obsolete for new data.
One embodiment of the present invention is directed to a method for training a semantic indexing model. In accordance with the method, a search engine is provided with a first query. In addition, a set of documents of a plurality of documents related to the first query is received from the search engine. Further, an expanded query is generated by merging at least a portion of a subset of the set of documents with the first query. Additionally, the semantic indexing model is trained based on the expanded query.
Another embodiment of the present invention is directed to a method for incorporating a time-based measure in a semantic indexing model. In accordance with the method, a query is received. At least one time difference parameter denoting a time difference between receipt of the query and a generation of at least one document of a plurality of documents is determined. In addition, a similarity measure is modified based on the time difference parameter(s). Further, at least a subset of the plurality of documents are ranked based on the modified similarity measure.
Another embodiment of the present invention is directed to a system for training a semantic indexing model. The system includes a search engine, a query generator unit and a controller. The search engine is configured to receive a first query and generate a set of documents of a plurality of documents related to the first query. In addition, the query generator unit is configured to generate an expanded query by merging at least a portion of at least a subset of the set of documents with the first query. Further, the controller is configured to train the semantic indexing model based on the expanded query.
These and other features and advantages will become apparent from the following detailed description of illustrative embodiments thereof, which is to be read in connection with the accompanying drawings.
The disclosure will provide details in the following description of preferred embodiments with reference to the following figures wherein:
Exemplary embodiments of the present invention described herein improve SSI ranking methods and systems by compensating for a lack of training data and implementing time difference features to address changes in the distribution of queries over time. To compensate for a lack of training data, query terms can be expanded with the top N relevant documents/items of a search engine, and SSI models can be trained using these expanded query vectors. Here, when expanding query terms, normalization may apply. To address shifting of queries over time, a time feature is introduced. In particular, the time feature can denote the difference between the time the query is generated and the time when a document is generated. In preferred embodiments, this time feature can be used in training and testing to bias newer documents over older documents.
It should be understood that embodiments described herein may be entirely hardware or may include both hardware and software elements, which includes but is not limited to firmware, resident software, microcode, etc. In a preferred embodiment, the present invention is implemented in hardware and software.
Embodiments may include a computer program product accessible from a computer-usable or computer-readable medium providing program code for use by or in connection with a computer or any instruction execution system. A computer-usable or computer readable medium may include any apparatus that stores, communicates, propagates, or transports the program for use by or in connection with the instruction execution system, apparatus, or device. The medium can be magnetic, optical, electronic, electromagnetic, infrared, or semiconductor system (or apparatus or device) or a propagation medium. The medium may include a computer-readable storage medium such as a semiconductor or solid state memory, magnetic tape, a removable computer diskette, a random access memory (RAM), a read-only memory (ROM), a rigid magnetic disk and an optical disk, etc.
A data processing system suitable for storing and/or executing program code may include at least one processor coupled directly or indirectly to memory elements through a system bus. The memory elements can include local memory employed during actual execution of the program code, bulk storage, and cache memories which provide temporary storage of at least some program code to reduce the number of times code is retrieved from bulk storage during execution. Input/output or I/O devices (including but not limited to keyboards, displays, pointing devices, etc.) may be coupled to the system either directly or through intervening I/O controllers.
Network adapters may also be coupled to the system to enable the data processing system to become coupled to other data processing systems or remote printers or storage devices through intervening private or public networks. Modems, cable modem and Ethernet cards are just a few of the currently available types of network adapters.
Reference in the specification to “one embodiment” or “an embodiment” of the present principles, as well as other variations thereof, means that a particular feature, structure, characteristic, and so forth described in connection with the embodiment is included in at least one embodiment of the present principles. Thus, the appearances of the phrase “in one embodiment” or “in an embodiment”, as well any other variations, appearing in various places throughout the specification are not necessarily all referring to the same embodiment.
It is to be appreciated that the use of any of the following “/”, “and/or”, and “at least one of”, for example, in the cases of “A/B”, “A and/or B” and “at least one of A and B”, is intended to encompass the selection of the first listed option (A) only, or the selection of the second listed option (B) only, or the selection of both options (A and B). As a further example, in the cases of “A, B, and/or C” and “at least one of A, B, and C”, such phrasing is intended to encompass the selection of the first listed option (A) only, or the selection of the second listed option (B) only, or the selection of the third listed option (C) only, or the selection of the first and the second listed options (A and B) only, or the selection of the first and third listed options (A and C) only, or the selection of the second and third listed options (B and C) only, or the selection of all three options (A and B and C). This may be extended, as readily apparent by one of ordinary skill in this and related arts, for as many items listed.
Referring now to the drawings in which like numerals represent the same or similar elements and initially to
With reference to
where q is a word vector of the query q, d+ is a word vector of relevant documents d+, d− is a word vector irrelevant documents d−, and ƒ is a similarity function between a query and documents. A word vector, such as q, d+ or d−, is a vector of D, where D is the size of a vocabulary. Each word (term) is assigned a fixed position in this vector, and the value at that position in the vector would be the weight of the word in that text entity, which is either a query or a document. This representation is a “vector space model” in which the similarity of two text entities can be calculated by the dot product of their word vectors. Here, the text entities are denoted as underlined letters, such as q for query or d for document, and their word vectors are denoted as italic letters, such as q for query q or d for d. The similarity function ƒ can be a low rank linear model on pairwise features, among other functions and can be determined by solving the optimization problem of equation (1). In accordance with exemplary embodiments, the similarity function can be ƒ(q,d)=qTUTUd, ƒ(q,d)=qTUTVd, a or other functions. Thus, the system ranks documents based on scores provided by the similarity function ƒ between words of a query q and a given document d, where the documents d with the highest scores for the query q are given the highest rankings. As discussed further herein below, the query feature q, which is orthogonal to exploring similarity measures, is modeled and expanded. In particular, the preferred embodiments of the ranker 112 rank documents for a query q′ based on the similarity function ƒ, where ƒ is determined such that the following loss is minimized:
In accordance with one exemplary aspect, the controller 102 can solve the optimization problem (2) by applying a Stochastic Gradient Descent on {(q, d+, d−)}, or on its subset.
To generate and apply the expanded query q′, the method 200 can begin at step 202, at which the system 100 can receive a query q through the user-interface 106 from a user and can provide the query q to the search engine 108.
At step 204, the query generator 114 can receive a set of documents related to the query q. For example, the search engine 108 can apply an existing searching algorithm to obtain a set of documents that are relevant to the query q. Here, the query generator 114 can select a set S of the top k documents {d1, d2, . . . , dk}, where k is a pre-defined parameter (e.g., k=5).
At step 206, the query generator unit 114 can generate a new query q′ by merging at least a subset S of the received set of documents with the query q. For example, the query generator 114 can merge words in S with q to generate the new q′. The merging can be implemented by merging the text of query q with the text of documents in S and calculating the weight vector on the resulting text as q′. Alternatively, the query generator 114 can calculate the word weights separately on q and S, average the weights, and then set q′ as the average. In each of these cases, to calculate the word weights, the query generator 114 can use a binary representation, where, for example, the component of the vector is populated with a 1 when the word occurs, 0 otherwise. In addition, the query generator 114 can utilize term frequency (TF), term frequency-inverse document frequency (TF-IDF), OKAPI BM25, etc. Other methods for generating a new query q′ by merging the set S of documents with the query q can also be employed.
At step 208, the controller 102 can train the SMI model ƒ based on at least one of the documents and the expanded query q′. For example,
After the system 100 is trained, at least to some degree, in accordance with the method 200, the system 100 can perform the method 250 of
At step 210, the ranker 112 can rank documents for the query q′ in accordance with the trained model ƒ. For example, to better illustrate how step 210 can be implemented, reference is made to
The method can also proceed to step 214, where the controller 102 can continue training the model as discussed above with respect to step 208 of the method 200. For example, after the ranked set S′ is output to the user, the controller 102 can monitor the documents that were clicked by the user and also the documents that were presented to the user and not clicked by the user to update the parameters U, V of the model ƒ, as discussed above with respect to step 208. Thereafter, the method 200 can repeat with a different query q entered by the user.
It should be noted that the methods 200 and 250 can be implemented in a variety of ways. For example, with regard to step 204, the top returned items {d1, d2, . . . , dk} that are merged with the query at step 206 can be obtained from a search engine, as discussed above. Alternatively, the top returned items can be selected by applying a cosine distance between the query vector q and the document vectors of various available stored documents, where the k document vectors with the closest cosine differences are selected as the top returned items. Here, query vectors and document vectors can be calculated, for example, by TFIDF, OKAPI BM25, or simply word counts. In accordance with another exemplary aspect, the top returned items can be selected by using the cosine distance between the query vectors and a low rank representation of documents, where the representations are obtained by applying singular value decompositions (SVD), principal component analysis (PCA), etc. to the documents. The k document representations with the closest cosine differences are selected as the top returned items that are merged with the query q at step 206. In another embodiment, the top returned items can simply be single words that have the highest similarity with one of the query terms of the query q, where the merging comprises merging single words of the document vectors with query words of the query q that are similar to the document words. The similarity between terms can be calculated by co-occurrence based measures, such as dice score, mutual information, etc. Alternatively, the similarity between terms can be calculated by cosine distance between embedding vectors of the words. Embeddings can be generated by factor analysis models like SVD, PCA or supervised embeddings.
Further, in accordance with other exemplary aspects of the present invention, the expansion procedure at step 206 can be performed by summing up the normalized TFIDF weights for query terms and the terms in the top k documents, and normalizing the resulting vector to have norm 1. Alternatively, the query generator 114 can calculate the normalized TFIDF on the concatenated text of the query and the top k documents.
As noted above, exemplary embodiments of the present invention can implement time difference features to address changes in the distribution of queries over time. This aspect is important for several reasons. For example, technical support documents for certain products lose value when these products are obsolete. Including these documents during training of models simply introduces noise and reduces the quality of rankings. However, newly created documents are more likely to be reused because they are often associated with popular new products. Thus, a “time difference” feature should be introduced into the similarity function. Specifically, preferred embodiments of the present invention employ the variable TimeDiff (q,d)=time (q)−time(d), which is the difference between the time when the query q is generated, time(q), and the time when a document was generated/updated, time(d).
With reference now to
At step 504, the time difference module 116 can determine one or more time difference parameters denoting a time difference between the generation of the query and the generation of documents. For example, for each document d stored in the system, the time difference module 116 can determine the time difference TimeDiff (q,d)=time(q)−time(d). Here, the time of the generation of the query can be the time that the query is received by the system 100 and the time of the generation of a document can be the time that the document is first stored in the system 100 or the time at which the document was most recently updated.
At step 506, the similarity scoring module 110 can determine/modify a similarity measure based on the time difference parameter(s). For example, the variable could be used at step 506 as a reweighting factor on the original SSI score, i.e., ƒ(q,d)=ƒssi(q,d)*TimeDiff(q,d). For example, with reference to the diagram 600 of
or T(TimeDiff(q,d)) can be a logarithmic function that appropriately transforms TimeDiff(q,d). For example, the time difference parameter can be applied as follows, where TimeDiff′(q,d)=log(TimeDiff(q,d)) and where ft(q,d)=T(TimeDiff′(q,d)).
Alternatively, the time difference variable TimeDiff (q,d) could be employed and treated just as other words in the query q. Here, the time difference can be a special type of “word” that occupies its own entry in the vector q and the value of this word feature is the TimeDiff(q,d) with a specific d. Thus, the score ƒssi(q,d) can simply be determined with a special time difference “word” in q and a particular document d, where the value of the score ƒfsse(q,d) in this case increases with a decreasing time difference TimeDiff(q,d). Here, the variable TimeDiff(q,d) can be transformed by a logarithm function first then used in place of TimeDiff(q,d).
At step 508, the ranker 112 can determine/rank documents based on the similarity measures f(q,d) as discussed above. For example, steps 204-210 can be performed as discussed above with respect to the method 250 of
As discussed above, the approaches described herein reduce rank errors, which in turn produce higher performance in terms of other metrics, such as, for example, Mean Average Precision. Expanding query terms with the terms' top-ranked items makes the query vector larger. Further, the expanded query terms are relevant to the query terms to some extent. This expanded query term vector reduces the overfitting effect when the training data is limited, which is often true when training with word features. Moreover, training on time features incorporates a time-dependent factor of a document, which is important, as many documents may lose a searcher's interest after some time. The trained time feature will optimally reweight a document by how long it has existed at the time of the query, and reduce the likelihood that obsolete documents are presented to the searcher.
Referring now to
Having described preferred embodiments of methods and systems for query generation and implementation of time difference features for supervised semantic indexing (which are intended to be illustrative and not limiting), it is noted that modifications and variations can be made by persons skilled in the art in light of the above teachings. It is therefore to be understood that changes may be made in the particular embodiments disclosed which are within the scope of the invention as outlined by the appended claims. Having thus described aspects of the invention, with the details and particularity required by the patent laws, what is claimed and desired protected by Letters Patent is set forth in the appended claims.
This application claims priority to provisional application Ser. No. 61/719,474 filed on Oct. 28, 2012, incorporated herein by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
61719474 | Oct 2012 | US |