Online search engines provide a powerful means for users to locate content on the web. Perhaps because search engines are software programs, they developed to more efficiently process queries entered in a form such as a Boolean query that mirrors the formality of a programming language. However, many users may prefer to enter queries in a natural language form, similar to how they might normally communicate in everyday life. For example, a user searching the web to learn the capital city of Bulgaria may prefer to enter “What is the capital of Bulgaria?” instead of “capital AND Bulgaria.” Because many search engines have been optimized to accept user queries in the form of a formal query, they may be less able to efficiently and accurately respond to natural language queries.
Previous solutions tend to rely on a curated knowledge base of data to answer natural language queries. This approach is exemplified by the Watson question answering computing system created by IBM®, which famously appeared on and won the Jeopardy!® game show in the United States. Because Watson and similar solutions rely on a knowledge base, the range of questions they can answer may be limited to the scope of the curated data in the knowledge base. Further, such a knowledge base may be expensive and time consuming to update with new data.
Techniques are described for answering a natural language question entered by a user as a search query, using machine learning-based methods to gather and analyze evidence from web searches. In some examples, on receiving a natural language question entered by a user, an analysis is performed to determine a question type, answer type, and/or lexical answer type (LAT) for the question. This analysis may employ a rules-based heuristic and/or a classifier trained offline using machine learning. One or more query units may also be extracted from the natural language question using chunking, sentence boundary detection, sentence pattern detection, parsing, named entity detection, part-of-speech tagging, tokenization, or other tools.
In some implementations, the extracted query units, answer type, question type, and/or LAT may then be applied to one or more query generation templates to generate a plurality of queries to be used to gather evidence to determine the answer to the natural language question. The queries may then be ranked using a ranker that is trained offline using machine learning, and the top N ranked queries may be sent to a search engine. Results (e.g., addresses and/or snippets of web documents) may then be filtered and/or ranked using another machine learning trained ranker, and candidate answers are extracted from the results based on the answer type and/or LAT. Candidate answers may be ranked using a ranker that is trained offline using machine learning, and the top answers may be provided to the user. A confidence level may also be determined for the candidate answers, and a top answer may be provided if its confidence level exceeds a threshold confidence.
This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
The detailed description is described with reference to the accompanying figures. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. The same reference numbers in different figures indicate similar or identical items.
Embodiments described herein provide techniques for answering a natural language question entered by a user as a search query. In some embodiments, a natural language question is received (e.g., by a search engine) as a search query from a user looking for an answer to the question. As described herein, a natural language question includes a sequence of characters that at least in part may employ a grammar and/or syntax that characterizes normal, everyday speech. For example, a user may ask the question “What is the capital of Bulgaria?” or “When was the Magna Carta signed?” Although some examples given herein describe a natural language question that includes particular question forms (e.g., who, what, where, when, why, how, etc.), embodiments are not so limited and may support natural language questions in any form.
To identify at least one answer to the natural question, embodiments employ four phases: Question Understanding, Query Formulation, Evidence Gathering, and Answer Extraction/Ranking. Each of the four phases is described in further detail with reference to
In some embodiments, Question Understanding includes analysis of the natural language question to predict a question type and an answer type. Question type may include a factoid type (e.g., “What is the capital of Bulgaria?”), a definition type (e.g., “What does ‘ambidextrous’ mean?”), a puzzle type (e.g., “What words can I spell with the letters BYONGEO”), a mathematics type (e.g., “What are the lowest ten happy numbers?”), or any other type of question. Answer types may include a person, a location, a time/date, a quantity, an event, an organism (e.g., animal, plant, etc.), an object, a concept, or any other answer type. In some embodiments, a lexical answer type (LAT) may also be predicted. The LAT may be more specific and/or may be a subset of the answer type. For example, a question with answer type “person” may have a LAT of “composer.” Prediction of question type, answer type, and/or LAT may use a rules-based heuristic approach, a classifier trained offline (e.g., prior to receiving the natural language question online) using machine learning, or a combination of these two approaches. In the example of
Question Understanding may also include the extraction of query units from the natural language question. Query units may include one or more of the following: words, base noun-phrases, sentences, named entities, quotations, paraphrases (e.g., reformulations based on synonyms, hypernyms, and the like), and facts. Query units may be extracted using a grammar-based analysis of the natural language question, including one or more of the following: chunking, sentence boundary detection, sentence pattern detection, parsing, named entity detection, part-of-speech tagging, and tokenization. In the example shown in
In some embodiments, the second phase is Query Formulation. In this phase, the information gained from the Question Understanding phase may be used to generate one or more search queries for gathering evidence to determine an answer to the natural language question. In some embodiments, the extracted query units as well as the question type, answer type, and/or LAT are applied to one or more query generation templates to generate a set of candidate queries. The candidate queries may be ranked using a ranker trained offline using an unsupervised or supervised machine learning technique such as support vector machine (SVM). In some embodiments, a predefined number N (e.g., 25) of the top ranked queries are sent to be executed by one or more web search engines such as Microsoft®Bing®. In the example shown in
In some embodiments, the third phase is Evidence Gathering, in which the top N ranked search queries are executed by search engine(s) and the search results are analyzed. In some embodiments, the top N results of each search query (e.g., as ranked by the search engine that executed the search query) are merged with one another to create a merged list of search results. In some embodiments, search results may include an address for a result web page, such as a Uniform Resource Locator (URL), Uniform Resource Identifier (URI), Internet Protocol (IP) address, or other identifier, and/or a snippet of content from the result web page. The merged search results may be filtered to remove duplicate results and/or noise results.
In a fourth phrase Answer Extraction/Ranking, candidate answers may be extracted from the search results. In some embodiments, candidate answer extraction includes dictionary-based entity recognition of those named entities in the search result pages that have a type that matches the answer type and/or LAT determined in the Question Understanding phase. In some embodiments, the extracted named entities are normalized to expand contractions, correct spelling errors in the search results, expand proper names (e.g., Bill to William), and so forth. In the example of
The candidate answers may then be ranked by applying a set of features determined for each candidate answer to a ranker trained offline using a machine learning technique (e.g., SVM). In the example of
Environment 200 further includes one or more client computing devices such as client device(s) 204. In some embodiments, client device(s) 204 are associated with one or more end users who may provide natural language questions to a web search engine or other application. Client device(s) 204 may include any type of computing device that a user may employ to send and receive information over networks 202. For example, client device(s) 204 may include, but are not limited to, desktop computers, laptop computers, tablet computers, e-Book readers, wearable computers, media players, automotive computers, mobile computing devices, smart phones, personal data assistants (PDAs), game consoles, mobile gaming devices, set-top boxes, and the like. Client device(s) 204 may include one or more applications, programs, or software components (e.g., a web browser) to enable a user to browse to an online search engine or other networked application and enter a natural language question to be answered through the embodiments described herein.
As further shown
In some embodiments, natural language question processing server device(s) 206 provide services for receiving, analyzing, and/or answering natural language questions received from users of client device(s) 204. These services are described further herein with regard to
In some embodiments, search engine server device(s) 208 provide services (e.g., a search engine software application and user interface) for performing online web searches. As such, these servers may receive web search queries and provide results in the form of an address or identifier (e.g., URL, URI, IP address, and the like) of a web page that satisfies the search query, and/or at least a portion of content (e.g., a snippet) from the resulting web page. Search engine server device(s) 208 may also rank search results in order of relevancy or predicted user interest. In some embodiments, natural language question processing server device(s) 206 may employ one or more search engines hosted by search engine server device(s) 208 to gather evidence for answering a natural language question, as described further herein.
In some embodiments, machine learning server device(s) 210 provide services for training classifier(s), ranker(s), and/or other components to classifying and/or ranking as described herein. These services may include unsupervised and/or supervised machine learning techniques such as SVM.
As shown in
Computing system 300 further includes a system memory 304, which may include volatile memory such as random access memory (RAM) 306, static random access memory (SRAM), dynamic random access memory (DRAM), and the like. RAM 306 includes one or more executing operating systems (OS) 308, and one or more executing processes including components, programs, or applications that are loadable and executable by processing unit 302. Such processes may include a natural language question process component 310 to perform actions for receiving, analyzing, gathering evidence pertaining to, and/or answering a natural language question provided by a user. These functions are described further herein with regard to
System memory 304 may further include non-volatile memory such as read only memory (ROM) 316, flash memory, and the like. As shown, ROM 316 may include a Basic Input/Output System (BIOS) 318 used to boot computing system 300. Though not shown, system memory 304 may further store program or component data that is generated and/or utilized by OS 308 or any of the components, programs, or applications executing in system memory 304. System memory 304 may also include cache memory.
As shown in
In general, computer-readable media includes computer-readable storage media and communications media.
Computer-readable storage media is tangible media that includes volatile and non-volatile, removable and non-removable media implemented in any method or technology for storage of information such as computer-readable instructions, data structure, program modules, and other data. Computer storage media includes, but is not limited to, RAM, ROM, erasable programmable read-only memory (EEPROM), SRAM, DRAM, flash memory or other memory technology, compact disc read-only memory (CD-ROM), digital versatile disks (DVDs) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information for access by a computing device.
In contrast, communication media is non-tangible and may embody computer readable instructions, data structures, program modules, or other data in a modulated data signal, such as a carrier wave or other transmission mechanism. As defined herein, computer-readable storage media does not include communication media.
Computing system 300 may also include input device(s) 326, including but not limited to a keyboard, a mouse, a pen, a game controller, a voice input device for speech recognition, a touch screen, a touch input device, a gesture input device, a motion- or object-based recognition input device, a biometric information input device, and the like. Computing system 300 may further include output device(s) 328 including but not limited to a display, a printer, audio speakers, a haptic output, and the like. Computing system 300 may further include communications connection(s) 330 that allow computing system 300 to communicate with other computing device(s) 332 including client devices, server devices, databases, and/or other networked devices available over one or more communication networks.
At 404, the natural language question and/or category is analyzed to predict or determine a question type and an answer type associated with the natural language question. In some embodiments, a LAT is also predicted for the question. One or more query units may also be extracted from the natural language question. These tasks are part of the Question Understanding phase, and are described in further detail with regard to
At 406, one or more search queries are formulated based on the analysis of the natural language question at 404. In some embodiments, this formulation includes applying the query units, question type, answer type, and/or LAT to one or more query generation templates. These tasks are part of the Query Formulation phase and are described further with regard to
At 408, evidence is gathered through execution of the one or more search queries by at least one search engine. This Evidence Gathering phase is described further with regard to
At 410, the search results resulting from execution of the one or more search queries are analyzed to extract or otherwise determine and rank one or more candidate answers from the search results. This Answer Extraction and Ranking phase is described further with regard to
At 412, one or more candidate answers are provided to the user. In some embodiments, a certain predetermined number of the top ranked candidate answers are provided to the user. In some embodiments, a confidence level may also be provided alongside each candidate answer to provide a measure of confidence that the system has that the candidate answer may be accurate. In some embodiments, a highest-ranked candidate answer is provided to the user as the answer to the natural language question, based on the confidence level for that highest-ranked candidate answer being higher than a predetermined threshold confidence level. Further, in some embodiments if there is no candidate answer with a confidence level higher than the threshold confidence level, the user may be provided with a message or other indication that no candidate answer achieved the minimum confidence level.
Mathematically, process 400 may be described as follows in Formula 1:
where Q denotes the input natural language question, denotes the hypothesis space of candidate answers, and h denotes a candidate answer. Embodiments aim to find the hypothesis (e.g., answer) h which maximizes the probability P(h|Q).
P(h|Q) may be further induced to P(h|Q, S, K), where S denotes the search engine and K denotes the knowledge base (in embodiments that use an adjunct knowledge base). The formula may be further decomposed into the following parts:
At 506, a lexical answer type (LAT) 508 may be determined based on an analysis of the natural language question. In some embodiments, the LAT 508 is a word or phrase which identifies a category for the answer to the natural language question. In some cases, the LAT may be a word or phrase found in the natural language question itself. In some embodiments, a heuristic, rules-based approach is used to determine the LAT. For example, a binary linear decision tree model may be employed, incorporating various rules, and the LAT may be determined by traversing the decision tree for each noun-phrase (NP) in the natural language question. Rules may include one or more of the following:
As an example application of the above rules, the following natural language question may be received: “He wrote his ‘Letter from Birmingham Jail’ from the city jail in Birmingham, Ala. in 1963.” This question may have been received with a category of “Prisoners' Sentences.” Determination of the LAT may follow the rules in the decision tree above:
In some embodiments, the LAT is predicted through a machine learning process by applying a classifier trained offline to one or more features of the natural language question. In embodiments, this machine learning-based approach for determining the LAT may be used instead of or in combination with the heuristic, rules-based approach described above.
At 510, an answer type 512 is determined based on an analysis of the natural language question. Answer type 512 may be a person, a location, a time/date, a quantity, an event, an organism (e.g., animal, plant, etc.), an object, a concept, or any other answer type. In some embodiments, a machine learning-trained classifier is used to predict the answer type based on a plurality of features of the natural language question. In some embodiments, a log-linear classification model may be employed. This model may be expressed mathematically as in Formula 2:
t=argmaxt
where t denotes the determined answer type, xj denotes the features for jΣ[1, K], and ti denotes the possible answer types for iε[1, N]. Features may include, but are not limited to, the following:
In some embodiments, prediction of the answer type may be performed based on application of a plurality of rules to the natural language question, either separate from or in combination with the machine learning-based technique described above.
At 514, one or more query units 516 are extracted from the natural language question, based on grammar-based and/or syntax based analysis of the question. Query units may include one or more of the following: words, base noun-phrases, sentences, named entities, quotations, paraphrases (e.g., reformulations based on synonyms, hypernyms, and the like), dependency relationships, time and number units, and facts. Further, some embodiments may employ at least one knowledge base as an adjunct to the search query-based methods described herein. In such cases, the extracted query units may also include attributes of the natural language question found in the at least one knowledge base. Extraction of query units may include one or more of the following: sentence boundary detection 518, sentence pattern detection 520, parsing 522, named entity detection 524, part-of-speech tagging 526, tokenization 528, and chunking 530.
At 606, the one or more candidate queries are ranked to determine a predetermined number N (e.g., top 20) of the highest ranked candidate queries. In some embodiments, ranking of candidate queries employs a ranker that is trained offline using an unsupervised or supervised machine learning technique (e.g., SVM), the ranker ranking the candidate queries based on one or more features of the candidate queries. At 608, the top N ranked candidate queries are identified as the one or more search queries 610 to be executed by one or more search engines during the evidence gathering phrase.
In some embodiments, the search results may have been ranked by the search engine according to relevance, and a top N (e.g. 20) number of search results may be selected from each set of search results for further processing. At 706, the top N search results from each set of search results are merged to form a merged set of search results for further processing. At 708, the merged search results are filtered to remove duplicate results and/or noise results. In some embodiments noise results may be determined based on a predetermined web site quality measurement (e.g., known low-quality sites may be filtered). In some embodiments, filtering may be further based on content readability or some other quality measurement of the content of the result web sites.
At 710, the search results are ranked using a ranker. In some embodiments, the ranker is trained offline using an unsupervised or supervised machine learning method (e.g., SVM), using a set of features. For example, for a natural language question Q, given the n candidate search result pages d1 . . . dn, the ranking may include a binary classification based on search result pairs <di, dj> where (1≦i, j≦n, i!=j). Linear ranking functions f{right arrow over (w)}, may be defined based on features related to d and/or features describing a correspondence between Q and d. The weight vector {right arrow over (w)} may then be trained using a machine learning technique such as SVM. In this example, the search results list may then be ranked according to a score which is a dot-product of the feature function values and their corresponding weights for each result page.
In some embodiments, the features used for ranking may include, but are not limited to, one or more of the following:
At 712, the top N ranked search results are selected and identified as search results 714 for candidate answer extraction during the Answer Extraction and Ranking phase. In some embodiments, the top number of ranked search results is tunable (e.g., N may be tuned) based on a performance criterion.
At 806, one or more features are extracted for the candidate answers, and at 808, the candidate answers are ranked based on the features. In some embodiments, the ranking is performed using a ranker trained offline through a machine learning process such as SVM. In some embodiments, for a natural language question Q and given the n candidate answers h1 . . . hn, the ranking may include a binary classification of candidate pairs <hi, hj> where (1≦i, j≦n, i!=j). Linear ranking functions f{right arrow over (w)}; may be defined based on features related to the candidate answer h (e.g. the frequency of appearance of the candidate answer in search result pages) and/or features describing a correspondence between Q and h (e.g. LAT match). The weight vector (e.g., ranker) {right arrow over (w)} may be trained using a machine learning method such as SVM, and the answer candidate list may then be ranked according to each candidate's score which is a dot-product of feature function values and the corresponding weights.
The features used may include features that are common to all answer types, and/or features that are specific to particular answer types. In some embodiments, the common features include but are not limited to the following:
In some embodiments, the answer type-specific features include but are not limited to those in Table 1.
At 810, a confidence level is determined for one or more of the candidate answers. In some embodiments, the confidence level is determined for the highest-ranked candidate answer. In some embodiments, the confidence level is determined for the top N ranked candidate answers or for all candidate answers. After the confidence level is determined, the answer may be provided to the user as described above with regard to
Although the techniques have been described in language specific to structural features and/or methodological acts, it is to be understood that the appended claims are not necessarily limited to the specific features or acts described. Rather, the specific features and acts are disclosed as example implementations of such techniques.