Many search engine services, such as Google and Overture, provide for searching for information that is accessible via the Internet. These search engine services allow users to search for display pages, such as web pages, that may be of interest to users. After a user submits a search request (also referred to as a “query”) that includes search terms, the search engine service identifies web pages that may be related to those search terms. To quickly identify related web pages, the search engine services may maintain a mapping of keywords to web pages. This mapping may be generated by “crawling” the web (i.e., the World Wide Web) to identify the keywords of each web page. To crawl the web, a search engine service may use a list of base web pages to identify all web pages that are accessible through those base web pages. The keywords of any particular web page can be identified using various well-known information retrieval techniques, such as identifying the words of a headline, the words supplied in the metadata of the web page, the words that are highlighted, and so on. The search engine service may generate a relevance score to indicate how related the information of the web page may be to the search request. The search engine service then displays to the user links to those web pages in an order that is based on their relevance.
Discussion threads are a popular way for people to communicate using the Internet. A particular popular type of discussion thread service is a web forum. A web forum is a web site that allows users of the web site to post information that is available to be viewed by other users of the web site. A discussion thread, such as a newsgroup, allows people to participate in a discussion about a specific topic. A discussion thread is typically initiated when a person creates an initial message directed to a topic and posts the message as a new discussion thread. Other persons can read the initial message and post response messages to the discussion thread. For example, the initial message may pose a question such as “Has anyone encountered a situation where the Acme software product aborts with error number 456?” Persons who want to participate in the discussion can post response messages such as “It happens to me all the time” or “I fixed the problem by reinstalling the software.” Discussion threads typically take the form of a tree structure as sequences of messages branch off into different paths. For example, three different persons can post a response message to the initial message, starting three branches, and other persons can post response messages to any one of those response messages to extend those branches.
Discussion threads may include questions and their answers. For example, a customer support group within a company that sells a certain software product may provide a mechanism for its customers to create and participate in discussion threads relating to the software product. For example, a customer may initiate a discussion thread by posting an initial message that poses a question such as the one mentioned above. That question may be answered by the posting of a response message by another customer or a customer service representative. The corpus of discussion threads of the company may provide a vast amount of knowledge related to problems and concerns that customers may encounter along with appropriate responses (e.g., answers to questions posed).
When a customer wants an answer to a question, the customer may either initiate a new discussion thread or search messages of existing discussion threads that may provide an answer to the customer's question. When searching for an answer within the messages of a corpus of discussion threads, a customer may submit a short query using keywords of the question. For example, the customer may submit the query “error 456” in hopes of finding an answer to the question mentioned above. A search engine may be used to identify those messages that contain keywords matching the query. In many instances, the messages that best match the keywords of the query are the messages that pose a similar question. The response messages may not result in a good keyword match in part because they may not repeat the keywords of the question. The most relevant message to the customer, however, may be a response message that answers the question, rather than a message that poses a similar question.
A method and system for identifying alternate queries for an initial query submitted by a user to a search system is provided. The search system receives an initial query from a user. Upon receiving the initial query, the search system identifies questions that are related to the initial query. The search system may identify various queries from a question store that contains questions identified from discussion threads or from queries submitted by users of a search engine. The search system may then rank the related questions based on their similarity to the initial query using a variety of metrics such as those described above. The search system then presents the ranked questions to the user as suggested alternate queries to the initial query. When a user selects one of the alternate queries as a final query, the search system searches for documents (e.g., messages of a discussion thread or a web page) that match the query. The search system then presents the matching documents to the user as the search results.
The search system may identify messages within a discussion thread that include answers. The search system identifies candidate messages of a discussion thread that are submitted by a person other than the initiator of the discussion thread. The search system then removes as candidate messages those messages that can be identified as very likely to not be an answer. The search system may train a support vector machine to classify messages as being not answers using the training data of the sample posts along with an indication of whether each message is an answer or not. The search system then ranks the remaining candidate messages based on a likelihood of the message being the best answer. The search system may use the determination that a message is an answer when providing search results to a query.
The search system may also identify an expert relating to the subject of a query. The search system may identify experts from among persons who have given answers in a discussion thread. The search system then creates an expert profile for each expert. An expert profile is a collection of keywords (e.g., from questions that the expert answers) that relate to the discussion threads in which the expert participated. When a user wants to identify an expert, the user submits a query to the search system. The search system may use conventional search techniques to identify expert profiles the best match the query. The search system then presents the experts associated with the best matching expert profiles to the user.
This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.
A method and system for identifying alternate queries for an initial query submitted by a user to a search system is provided. In one embodiment, the search system receives an initial query from a user. For example, the query may be submitted by a user searching for an answer to a question within a web forum or searching for web pages that are related to the query. Upon receiving the initial query, the search system identifies questions that are related to the initial query. The search system may identify various queries from a question store that contains questions identified from discussion threads or from queries submitted by users of a search engine. The search system may use various techniques for identifying whether a question is related to or matches a query. For example, the search system may use a cosine similarity metric, an edit distance metric, a language model-based metric, and so on. The search system may then rank the related questions based on their similarity to the initial query using a variety of metrics such as those described above. The search system then presents the ranked questions to the user as suggested alternate queries to the initial query. When a user selects one of the alternate queries as a final query, the search system searches for documents (e.g., messages of a discussion thread or a web page) that match the query. The search system then presents the matching documents to the user as the search results. In this way, the search system suggests alternate queries to a user that are derived from the corpus of documents of interest and thus may more likely identify documents of interest to the user.
In one embodiment, the search system may use expanded queries of the initial query to search for related questions. Upon receiving an initial query, the search system may expand any acronyms of the initial query. Then, for each word of the initial query, the search system may correct any misspelling of the word, perform stemming on the word, and identify synonyms of that word. For example, if the query is “displaing icons,” the search system may correct the spelling to “displaying,” stem the word to “display,” and identify the synonyms of “render,” “output,” and “show.” The search system may then generate the expanded queries of “render icon,” “output icon,” and “show icon.” The search system searches for questions that are related to each expanded query and considers the related questions to also be related to the initial query. The search system then ranks and presents the related questions to the user as alternate queries as described above. In an alternate embodiment, the search system may perform the searching for and presenting of alternate queries as the user types a query. For example, each time a user enters a new letter or completes a word of the initial query, the search system may update a list of alternate queries that is presented to the user. At any time, the user may select an alternate query from the list as the final query.
In one embodiment, the search system treats the identification of alternate queries as a “translation” from one language to another language. The search system thus may employ a source-channel model used in such translations to identify questions that are related to a query (e.g., initial query or expanded query). In particular, the search system may select according to the following equation:
where P(y|x) is the conditional probability of query x given question y, P(y) is the probability of question y, and P(x|y) is the conditional probability of question y given query x. P(y) corresponds to a language model, and P(x|y) corresponds to a translation model. The search system creates a language model using a corpus of documents and estimates the bigram probabilities (e.g., P(mining|data) and P(book|mining)). The search system may use various smoothing techniques to account for sparse data in the corpus. The search system builds a translation model using merging (e.g., “book store” to “bookstore”), splitting (e.g., “Patentand Trademark Office” to “Patent and Trademark Office”), switching (e.g., “mining data” to “data mining”), and replacing (e.g., “data mine” to “data mining”). The search system may assume that all the translation probabilities are 1. The search system may use various dynamic programming techniques (e.g., a Viterbi algorithm) to identify the most probable translation path within a lattice of paths.
In one embodiment, the search system identifies messages within a discussion thread that include answers. The search system identifies candidate messages of a discussion thread that are submitted by a person other than the initiator of the discussion thread. For example, the initiator of a discussion thread may pose a question that is the subject of the discussion thread. The search system then removes as candidate messages those messages that can be identified as very likely to not be an answer. The search system may train a support vector machine to classify messages as being not answers using the training data of the sample posts along with an indication of whether each message is an answer or not. Alternatively, the search system may use language patterns (e.g., “I have a similar problem”) to identify messages that are very likely not an answer. The search system may also use a Conditional Random Fields (“CRF”) model to label the messages. A CRF model is an undirected graphical model trained to maximize a conditional probability, as described in Lafferty, J., McCallum, A., and Pereira, F., “Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data,” ICML 2001. The search system then ranks the remaining candidate messages based on a likelihood of the message being an answer. For example, the search system may train a ranking support vector machine to rank the answer messages. The search system may represent candidate messages for ranking purposes using various features. These features may include length of the candidate message, the total number of messages in the web forum submitted by the submitter of the candidate message, the number of messages of the discussion thread submitted by the submitter of the candidate message, the order in the discussion thread of all messages submitted by the submitter of the candidate message, content heuristics (e.g., number of external links and presence of answer-indicating words or phrases), and so on. The search system may use the determination that a message is an answer when providing search results to a query. For example, the search system may include in the search results only messages that are highly ranked as answers or may rank such messages higher within the search results.
The search system may use various techniques to train the classifier to classify messages as answers or not answers. The classifier may be trained to generate discrete values (e.g., 1 or 0) indicating whether or not a messages is an answer or continuous values (e.g., between 0 and 1) indicating the likelihood that a message is an answer. The search system may use support vector machine techniques to train the classifier. A support vector machine operates by finding a hyper-surface in the space of possible inputs. The hyper-surface attempts to split the positive examples (e.g., features of answer messages) from the negative examples (e.g., non-answer messages) by maximizing the distance between the nearest of the positive and negative examples to the hyper-surface. This allows for correct classification of data that is similar to but not identical to the training data. Various techniques can be used to train a support vector machine. One technique uses a sequential minimal optimization algorithm that breaks the large quadratic programming problem down into a series of small quadratic programming problems that can be solved analytically. (See Sequential Minimal Optimization, at http://research.microsoft.com/˜jplatt/smo.html.)
The search system may use a RankSVM algorithm to rank the candidate messages as answers. A RankSVM algorithm, which is a variation of a generalized support vector machine (SVM), attempts to learn a ranking function that preserves the pairwise partial ordering of the candidate messages of training data. A RankSVM algorithm is an ordinal regression technique to minimize the number of incorrectly ranked pairs. A RankSVM algorithm is described in Joachims, T., “Optimizing Search Engines Using Clickthrough Data,” Proceedings of the ACM Conference on Knowledge Discovery and Data Mining (“KDD”), ACM, 2002. Another example of a technique for learning a ranking function is a RankBoost algorithm. A RankBoost algorithm is an adaptive boosting algorithm that, like a RankSVM algorithm, operates to preserve the ordering of pairs of candidate messages. A RankBoost algorithm attempts to directly solve a preference learning. A RankBoost algorithm is described in Freund, Y., Iyer, R., Schapire, R., and Singer, Y., “An Efficient Boosting Algorithm for Combining Preferences,” Journal of Machine Learning Research, 4, 2003. As another example, a neural network algorithm, referred to as RankNet, may be used to rank candidate messages. A RankNet algorithm also operates to preserve the ordering of pairs of candidate messages and models the ordinal relationship between two documents using a probability. A RankNet algorithm is described in Burges, C., Shaked, T., Renshaw, E., Lazier, A., Deeds, M., Hamilton, N., and Hullender, G., “Learning to Rank Using Gradient Descent,” 22nd International Conference on Machine Learning, Bonn, Germany, 2005.
In one embodiment, the search system identifies an expert relating to the subject of a query. The search system may identify experts from among persons who have participated in a discussion thread. For example, the search system may identify an expert as any person who has submitted more than a certain number of messages within a web forum. Alternatively, the search system may use rating information on messages provided by participants of a discussion thread to identify an expert. The search system then creates an expert profile for each expert. An expert profile is a collection of keywords that relate to the discussion threads in which the experts participated. For example, the search system may identify keywords from all the discussion threads in which an expert participated and add all the identified keywords to the expert profile of that expert. The search system may create an index of keywords to expert profiles to facilitate searching for expert profiles. When a user wants to identify an expert, the user submits a query to the search system. The search system may use conventional search techniques (e.g., a term frequency by inverse document frequency metric) to identify expert profiles that best match the query. The search system may also preprocess the query removing non-keywords from the query to facilitate the searching. The search system then presents the experts associated with the best matching expert profiles to the user.
The computing device on which the search system is implemented may include a central processing unit, memory, input devices (e.g., keyboard and pointing devices), output devices (e.g., display devices), and storage devices (e.g., disk drives). The memory and storage devices are computer-readable media that may be encoded with computer-executable instructions that implement the search system, which means a computer-readable medium that contains the instructions. In addition, the instructions, data structures, and message structures may be stored or transmitted via a data transmission medium, such as a signal on a communication link. Various communication links may be used, such as the Internet, a local area network, a wide area network, a point-to-point dial-up connection, a cell phone network, and so on.
Embodiments of the search system may be implemented in various operating environments that include personal computers, server computers, hand-held or laptop devices, multiprocessor systems, microprocessor-based systems, programmable consumer electronics, digital cameras, network PCs, minicomputers, mainframe computers, cell phones, personal digital assistants, smart phones, personal computers, programmable consumer electronics, distributed computing environments that include any of the above systems or devices, and so on.
The search system may be described in the general context of computer-executable instructions, such as program modules, executed by one or more computers or other devices. Generally, program modules include routines, programs, objects, components, data structures, and so on that perform particular tasks or implement particular abstract data types. Typically, the functionality of the program modules may be combined or distributed as desired in various embodiments. For example, separate computing systems may crawl the web forums, identify answers, identify experts, suggest alternate queries, and train the various classifiers.
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims. Accordingly, the invention is not limited except as by the appended claims.