The following relates generally to interactive computerized search, and more specifically, to a computer-based system, method and computer program product for query clarification.
Modern search engines place an enormous amount of information within reach, but can offer disappointing results when users are unable to form their queries with sufficient specificity. Query disambiguation permits search engines to determine context of a particular search in an effort to deliver relevant results.
One common method for query disambiguation is to use personalization, which builds a user profile which is then used to tailor results to the user's interests. Personalization can make use of a user's short- (within session) and long-term (historic) browsing behaviour to disambiguate a user's query within a given session. For example, webpages previously or repeatedly visited by a user are often highly relevant to these or other similar users. Other systems have considered information external to searches, such as email messages, calendar items, or documents on the user's device. Generally speaking, the richer the representation of the user, the better.
Query expansion is another common method for search refinement. Query expansion seeks to rewrite the query so as to retrieve a smaller set of results. For example, queries can be supplemented with words similar to the original query.
There are cases where query disambiguation and/or query expansion are insufficient to locate the most relevant search result for a user. It is therefore an object of the present invention to provide a system, method and computer program product in which the above disadvantages are obviated or mitigated and attainment of the desirable attributes is facilitated.
In one aspect, a system for query clarification applied to a set of search results generated by a search query performed on a corpus of data is provided, the system comprising one or more processors configured to execute: a labelling engine to apply to each datum within the corpus a label corresponding to each one of a plurality of predetermined indicator variables, each indicator variable relating to context of the respective data; and a clarification engine to: generate a decision tree using the set of search results, the decision tree comprising nodes corresponding to the indicator variables and edges corresponding to the labels, the decision tree generated to maximize information gain based on pruning the decision tree in response to obtaining a desired label for a selected indicator variable; and prune the decision tree in response to a question posed to a user to obtain a label for an indicator variable.
In a particular case, each indicator variable corresponds to a question and each label of associated edges corresponds to an answer associated with the question.
In another case, each indicator variable represents a category of interest to a particular field represented by the corpus of data.
In yet another case, at least one of the labels is unknown.
In yet another case, each datum in the corpus comprises one or more webpages.
In yet another case, at least a portion of the data are manually labelled and the labelling engine applies inheritance of the labels to webpages associated with the manually labelled data.
In yet another case, the labelling engine uses a trained supervised learning classifier for each of the indicator variables to label the data, the supervised learning classifier trained using a set of manually labelled data for training and testing.
In yet another case, maximizing information gain comprises determining the information gained in knowing a value of each indicator variable, the indicator variable with a largest potential information gain being used to split the search results into subsets according its value to produce a node in the decision tree, and wherein the question posed to the user results in obtaining a label or value for the indicator variable.
In yet another case, the clarification engine iteratively performs, to prune the decision tree, determining the information gained and posing the question to the user that will provide the largest information gain.
In yet another case, information gain is determined by determining a probability that the answer provided by the user to each question is accurate, and that a desired search result to the search query will be found in the set of documents represented within that answer.
In another aspect, a computer-implemented method for query clarification applied to a set of search results generated by a search query performed on a corpus of data is provided, the method comprising: applying to each datum within the corpus a label corresponding to each one of a plurality of predetermined indicator variables, each indicator variable relating to context of the respective data; generating a decision tree using the set of search results, the decision tree comprising nodes corresponding to the indicator variables and edges corresponding to the labels, the decision tree generated to maximize information gain based on pruning the decision tree in response to obtaining a desired label for a selected indicator variable; and pruning the decision tree in response to a question posed to a user to obtain a label for an indicator variable.
In a particular case, each indicator variable corresponds to a question and each label of associated edges corresponds to an answer associated with the question.
In another case, each indicator variable represents a category of interest to a particular field represented by the corpus of data.
In yet another case, at least one of the labels is unknown.
In yet another case, each datum in the corpus of data comprises one or more webpages.
In yet another case, at least a portion of the data are manually labelled and the labelling engine applies inheritance of the labels to webpages associated with the manually labelled data.
In yet another case, applying the label comprises using a trained supervised learning classifier for each of the indicator variables to label the data, the supervised learning classifier trained using a set of manually labelled data for training and testing.
In yet another case, maximizing information gain comprises determining the information gained in knowing a value of each indicator variable, the indicator variable with a largest potential information gain is used to split the search results into subsets according its value to produce a node in the decision tree, and wherein the question posed to the user comprises posing a question to the user results in obtaining a label or value for the indicator variable.
In yet another case, pruning the decision tree comprises iteratively determining the information gained and posing the question to the user that will provide the largest information gain.
In yet another case, information gain is determined by determining a probability that the answer provided by the user to each question is accurate, and that a desired search result to the search query will be found in the set of documents represented within that answer. In one aspect, a system for query clarification applied to a set of search results generated by a search query performed on a corpus of data is provided, the system comprising: a labelling engine configured to apply to each datum within the corpus a label corresponding to each one of a plurality of predetermined indicator variables, each indicator variable relating to context of the respective data; and a clarification engine configured to: generate a decision tree using the set of search results, the decision tree comprising nodes corresponding to the indicator variables and edges corresponding to the labels, the decision tree generated to maximize information gain based on pruning the decision tree in response to obtaining a desired label for a selected indicator variable; and prune the decision tree in response to a question posed to a user to obtain a label for an indicator variable.
In further aspects, methods and computer programs for implementing the system are provided.
The features of the invention will become more apparent in the following detailed description in which reference is made to the appended drawings wherein:
Embodiments will now be described with reference to the figures. For simplicity and clarity of illustration, where considered appropriate, reference numerals may be repeated among the Figures to indicate corresponding or analogous elements. In addition, numerous specific details are set forth in order to provide a thorough understanding of the embodiments described herein. However, it will be understood by those of ordinary skill in the art that the embodiments described herein may be practiced without these specific details. In other instances, well-known methods, procedures and components have not been described in detail so as not to obscure the embodiments described herein. Also, the description is not to be considered as limiting the scope of the embodiments described herein.
Various terms used throughout the present description may be read and understood as follows, unless the context indicates otherwise: “or” as used throughout is inclusive, as though written “and/or”; singular articles and pronouns as used throughout include their plural forms, and vice versa; similarly, gendered pronouns include their counterpart pronouns so that pronouns should not be understood as limiting anything described herein to use, implementation, performance, etc. by a single gender; “exemplary” should be understood as “illustrative” or “exemplifying” and not necessarily as “preferred” over other embodiments. Further definitions for terms may be set out herein; these may apply to prior and subsequent instances of those terms, as will be understood from a reading of the present description.
Any module, unit, component, server, computer, terminal, engine or device exemplified herein that executes instructions may include or otherwise have access to computer readable media such as storage media, computer storage media, or data storage devices (removable and/or non-removable) such as, for example, magnetic disks, optical disks, or tape. Computer storage media may include volatile and non-volatile, removable and non-removable media implemented in any method or technology for storage of information, such as computer readable instructions, data structures, program modules, or other data. Examples of computer storage media include RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by an application, module, or both. Any such computer storage media may be part of the device or accessible or connectable thereto. Further, unless the context clearly indicates otherwise, any processor or controller set out herein may be implemented as a singular processor or as a plurality of processors. The plurality of processors may be arrayed or distributed, and any processing function referred to herein may be carried out by one or by a plurality of processors, even though a single processor may be exemplified. Any method, application or module herein described may be implemented using computer readable/executable instructions that may be stored or otherwise held by such computer readable media and executed by the one or more processors.
The following relates generally to interactive computerized search, and more specifically, to a computer-based system, method and computer program product for query clarification. The method and system described herein utilize categorization of data within the corpus of information to be searched in order to assist the user in obtaining a desired search result. The system comprises a trained model for labelling data within a corpus with indicator variable values, whereby the values are indicative of categories of interest to users conducting searches on the corpus, and a clarification engine for interacting with users to refine search results on the basis of the categories.
Referring first to
In brief, the corpus of data (102) is any grouping of references of interest to a user. This could include, for example, the contents of the world wide web or any particular subset thereof; it could alternatively include proprietary data repositories. An illustrative example described below comprises web pages selected from various sources that are related to dementia diagnosis, treatment and care. The corpus could be generated using a web crawler on a set of websites. For example, the corpus could be generated using Apache™ Nutch™. In a more particular example, the web crawler can be used to generate a seed set of data, preferably obtained from reliable data, and from this seed set additional data may be added to the corpus in a breadth-first search to a predetermined depth.
For this type of example, the presently disclosed system and method ingest the data in the corpus and apply labels to each data item in the corpus. The labels are associated with a plurality of indicator variables which are related to the context of the data. Exemplary context is the type of source of the data—whether generated by users or institutions—and the accessibility of the data—whether free or paid. Several other indicator variables are described below for this example. Each of these indicator variables effectively act as potential “questions”, e.g., is the content free or does it require a payment? The labels comprise the various potential “answers” to the indicator variable “questions”. When a search is conducted, the search engine returns relevant results and the clarification engine generates a decision tree using these results. The decision tree is generated using nodes that are the indicator variables and edges that correspond to each label, and the decision tree is generated in order to determine the indicator variable that, if posed as a question to which the user responds, provides the most information gain for narrowing the search results. The decision tree can then be regenerated and a new question posed, iteratively, until a desired number of search results remains. A more detailed explanation now follows.
The query clarification method provided herein permits personalization of search results to the user without requiring a user profile. Rather, the query clarification method utilizes a direct question and answer process with the user to permit the user to navigate amongst possible search result categories.
Although question and answer search refinement has been used, for example, in virtual assistants and chatbots, the method and system described herein utilizes categorization of data (possible search results) in a corpus that is based on the context, rather than merely the content, of that data. In other words, the clarifying questions are not necessarily selected by determining the content of various data items and querying the user to select a most relevant content. Rather, the presently described system utilizes a predefined (depending on the field represented by the corpus) set of relevant semantic categories with which search results are subdivided, wherein each semantic category is represented as an indicator variable. The filtering (narrowing, selecting, ranking) of search results is conducted based on user responses to clarifying questions that are related to the indicator variables.
In embodiments, the system may begin with an existing ontology (such as the Yahoo™ subject hierarchy, or similar, for webpages) to train a classifier to automatically classify the data (such as webpages, for example) into ontological concepts which is then used to re-rank or filter results based on the user's profile of concepts. Categories can also be dynamically generated: automatically cluster search results into subsets, automatically generating a proposed query for each.
As previously mentioned, the present technique categorizes search result candidates based on a plurality of indicator variables. The set of indicator variables may be selected to reflect categories of interest and relevance to users in a particular field represented by the corpus of information being searched. The indicator variables can be considered as metadata that correspond to a Question:Answer pair, where the questions relate to the indicator variable and the possible answers relate to the labels that could populate each indicator variable. Illustrative examples of such Question:Answer pairs are Audience:Researchers (as opposed to Audience:Patients or Audience:Caregivers, for example) or Payment:Subscription (as opposed to Payment:Free, for example). Further examples are shown below in Table 1, which show an example of indicator variables and associated labels that might be relevant for querying a corpus of medical diagnostic or treatment data.
The example indicator variables shown in Table 1 are merely illustrative and not to be considered limiting to the claims. In practice, the set of ‘indicator variables’ and potential labels could be developed through focus groups of users interested in the corpus to determine the most important kinds of information. The indicator variables shown in Table 1, for example, were developed through focus groups of caregivers of individuals with dementia.
In addition, the system may be configured to apply labels to a subset of indicator variables for any particular data item. In other words, it may not be necessary to apply a label to all indicator variables; some may be left unknown. The reason for leaving indicator variables unknown could be computed, e.g., according to the confidence of the classifier given the datum.
Referring now to
Given a subset of the data labelled with their respective indicator variable labels, the labelling engine (104) applies supervised learning to obtain a model that automatically labels new documents with inferred indicator variable labels. Depending on the quantity of data available, some number of data items in a first subset of the labelled documents may be used for training, and the remainder in a second subset for testing. A separate classifier is used for each indicator variable. In a particular example, with the number of labelled data being 1688, two-thirds of the data were used for training and one-third for testing.
Exemplary classifiers include, for example, naïve Bayes (NB), support vector machines (SVMs, with a linear kernel Klin(xi, xj)=xiTxj), artificial neural networks (ANNs) and fastText, which is an efficient model for sentence classification that first learns an embedded representation of documents which are then fed into a multinomial logistic classifier. An exemplary ANN classification architecture is shown in
The NB and SVM classifiers make use of lexical features that encode the probability of various unigrams and bigrams occurring in documents of a specific class. In embodiments, the labelling engine (104) uses tf-idf (term frequency-inverse document frequency) weighted unigram and bigram counts to determine the importance of unigrams and bigrams to the document under classification. The following five feature types are illustrative examples, wherein the first two are ‘bag of n-gram’ approaches, the next three are distributed representations, and the final one is a combination of several lexical-syntactic features.
Individual token frequency: Each document is represented by a vector where each element contains the number of times the unigram, or bigram, occurs in the document.
Token frequency with tf-idf weighting: By using the product of tf, and idf, words that are present in most documents (e.g., “the”, “a”, “is”) may be penalized.
Mean of global vectors (GloVe): GloVe vectors are distributed representations of words obtained by training aggregated global word-word co-occurrence statistics from a corpus. Here, features may be generated by averaging GloVe vectors corresponding to each word occurring in the corpus.
Mean of word2vec vectors: Word2vec is an alternative model for learning word embeddings from raw text. An exemplary model may be trained on an existing dataset, such as the Google News dataset, for example.
Document (doc2vec) vectors: Doc2vec is an extension of word2vec that learns fixed-length vector representations of documents by predicting document contents.
Lexical-syntactic features: These are obtained from syntactic parses and part-of-speech tagging of the corpus. These features may include:
These feature and classifier types were evaluated in a validation of the presently described system. Exemplary results are shown below for illustrative assistance only, and are not intended as promises of particular performance in any specific implementation. Precision, recall, and F1 scores for each possible pair of feature and classifier type, averaged across the Indicator Variables, are shown in Table 2. In cases where indicator variables can take more than two values, the provided scores are the average for each value.
In these examples, using a corpus as described below, all combinations of feature-type and classifier-type yield high average scores. However, a combination of tf-idf weighted vectors and a linear SVM yield the best performance, with precision of 0.96 and recall of 0.89, with relatively little variation across indicator variables. The tf-idf weighted vectors outperform plain token-frequency vectors because they factor in the frequency of tokens in the corpus. This helps prevent the model from placing undue emphasis on less informative words that occur frequently in the corpus.
The performance of the ANN may be improved significantly by regularization. Possible over-fitting to the training data may be prevented by halting training once validation accuracy starts decreasing. Dropout layers may also be introduced to bias the network towards simpler models of the training data.
The relatively poor performance of the distributed vector representations, word2vec and GloVe, may be due to the composition of a single vector for each document by averaging out the vector representations of its constituent words. Order-sensitive representations such as sequence models and tree-structured LSTMs have fared better at generating semantic sentence representations that account for difference in meaning as a result of differences in word order or syntactic structure.
The generated indicator variable values are stored in the labelled index database (106), which is accessible to the clarification engine.
Moving now to query clarification on the basis of the indicator variables, the clarification engine (110) applies query clarification to refine search results. The clarification engine (110) automatically selects distinguishing questions for posing directly to the user, for example via an included user interface, wherein the possible answers to the distinguishing questions are related to the indicator variables.
In an embodiment, the query clarification technique comprises a modified ID3 technique to automatically construct an optimum decision tree given a corpus of data and associated labels to the data, to narrow down the search as quickly as possible. Each clarifying question reduces the number of search results, and each question corresponds to a branch in the decision tree. In a preferred embodiment, the question to be posed to the user is selected by greedily searching for the question that provides the most information gain at each branch, given training data. The search terminates when a sufficiently small number of results remains relevant.
Referring now to
At block 404, the search engine conducts the search against the corpus and returns the initial ranked results. In an example embodiment, the search engine (108) applies Apache™ Solr™, an open-source, full-text search engine. Search queries are conducted against the SoIr instance, and the set of search results consist of the top n relevant documents, ranked according to their Okapi BM25 scores. Given a query Q (containing keywords q1, . . . , qn), the BM25 score of a document D is expressed as:
where: f (qi, D) is the term frequency of qi in the document D; IDF(qi) is the inverse document frequency weight of the query term qi; |D| is the length of document D in words; avgdl is the average document length in the text collection; and k1 and b are free parameters either set to defaults or chosen through optimization.
The clarification engine (110) utilizes the indicator variables for the search results (that is, the top n relevant documents) obtained from the search engine (108). To determine which indicator variable to utilize in a clarification question, at block 406, the clarification engine (110) first generates decision trees from among the search results.
In an example embodiment, the clarification engine (110) implements a modified ID3 algorithm to generate decision trees from the search results. Given the set of search results C, the clarification engine calculates the information gained in knowing the value of each attribute Ak, where A is the set of indicator variables and k is each indicator variable. The indicator variable with the largest potential information gain is then used to split C into subsets according its value, producing a node in the decision tree. At block 408, the clarification engine (108) directs the user interface (112) to pose a question to the user to provide a label for the indicator variable that will provide the largest information gain, and gathers the user response accordingly at block 410.
In one embodiment, the clarification engine then iterates recursively on the resulting subsets of C. However, it may be preferred that the clarification engine not generate a single decision tree a priori for all possible queries, but instead in producing a decision tree dynamically according to the user's preferences and initial input. In this embodiment, the clarification engine configures a question to be posed by the user interface to the user, such that the user specifies values ai to attributes Ak after each iteration. In this embodiment, the ID3 algorithm is modified so that, at block 412, all irrelevant branches are pruned after each iteration based on information from the user. Thus, the clarification engine can be configured to pose new questions to the user at each utilized (unpruned) branch along the decision tree, wherein the question to present to the user is preferably the one with the highest expected information gain.
The clarification engine determines information gain by use of the probability that the user's answer to each question is accurate, and that the desired search result will be found in the set of documents represented within that answer. Search is directed toward determining which documents in a working set, D, are relevant to the user's needs. The uncertainty about the possible relevance of each document can be expressed as a function of a discrete probability distribution P(D) as shown in Equation 2, which is the likelihood that the user will find a desired document in the working set. For simplicity, it may be assumed that the user is only interested in a specific document in D, and initially P(D)=1/|D|, but this assumption is open to change.
The uncertainty present after the user reveals an answer a to the question Ak can similarly be expressed as a function of a probability density function P(D|Ak=a), as shown in Equation 3.
H(D|Ak=a)=−P(D|Ak=a)log2 P(D|Ak=a) (3)
Once again, for the sake of simplicity, it may be assumed that the user's answer is accurate and the one relevant document is present in the set documents whose attribute Ak=a. This permits use of a P(D|Ak=a) as expressed in Equation 4.
The full information gain of obtaining any answer to a question corresponding to indicator variable Ak is the difference between the initial uncertainty H(D) and the average uncertainty after revealing some value a of Ak. The latter is found by taking into account P(Ak=a), the probability that the user would give the answer a. In some embodiments, personalization, including historical data such as past user interactions, may be used to estimate this value. However, since it may be the case that personalization is non-existent or insufficient, the clarification engine (110) may estimate it using the fraction of documents in the index corresponding to the provided answer, i.e.,
The final expression for information gain is provided in Equation 6 and can be expressed as a function of document counts by plugging in Equations 2, 3, 4, and 5.
The clarification engine (110) correspondingly selects the indicator variable Ak that results in the maximal IG, and prompts the user to provide an answer a to the clarifying question corresponding to that variable. The clarification engine then updates D such that D=DA
In alternative embodiments as shown in
In
It will be appreciated that the techniques described herein need not be practiced in isolation. For example, the search engine (108) could incorporate third party search techniques, such as query expansion, personalization, ranking techniques, etc. without detracting from the presently described system and method.
For example, if a search engine uses query expansion to recommend alternate queries, it would be prudent to give prominence to rewritten queries that will benefit the most from query clarification. Equations 7 to 12, below, provide an example of how to select an optimal (rewritten query, clarifying question) pair.
The information gain (IG) offered by each query-question pair is calculated by considering how their respective values would partition the set of documents CQ=q, where CQ=q⊆C, and Q is a variable that indicates whether or not a document in the collection C would be in the search results returned after searching for query q.
Although the invention has been described with reference to certain specific embodiments, various transformations thereof will be apparent to those skilled in the art. The scope of the claims should not be limited by the preferred embodiments, but should be given the broadest interpretation consistent with the description as a whole.
Number | Date | Country | |
---|---|---|---|
62693888 | Jul 2018 | US |