The present invention deals with information management. More specifically, the present invention deals with providing a question answering system that answers a predetermined number of questions having a predefined form based on a user input query.
Management of electronic information presents many challenges. One such challenge is the ability to provide information to users of an electronic system, in response to queries by the users. Conventional systems for performing this management task have typically broken down into two categories, one being question answering, and the other being information retrieval.
Conventional question answering systems have, as a goal, answering any type of free form questions which are entered by a user. While this may be a very useful system, it is also very challenging to implement.
For instance, if a user can enter substantially any query, in any form, the question answering system must typically employ one of a relatively few number of known methods for discerning the meaning of the user's query, before it attempts to answer the query. One technique involves natural language processing. Natural language processing typically involves receiving a natural language input and determining the meaning of the input such that it can be used by a computer system. In the context of question answering, the natural language processing system discerns the meaning of a natural language query input by the user and then attempts to identify information responsive to that query.
Another common technique involves implementing handwritten rules. In such a system, an author attempts to think of every possible way that a user might ask for certain information. The author then writes a rule that maps from those possible query forms to responsive information.
Both of these prior techniques for implementing question answering systems can be relatively expensive to implement, and can be somewhat error prone. In large part, the expense and errors arise from the fact that these systems attempt to answer substantially any question which the user can input.
Prior information retrieval systems attempt to use key words provided by a user and find documents relevant to the key words. This involves other disadvantages, i.e., they cannot easily meet users' different search requests. The information retrieval system attempts to balance recall and precision in returning results. In other words, information retrieval system conventionally attempts to maximize the amount of relevant information which is returned (maximize recall) while minimizing the amount of irrelevant information that is returned (i.e., maximizing precision).
Queries input by users into these types of systems primarily breakdown into three categories: informational, transactional, and navigational. An informational query, for instance, is one which asks questions such as “Who is X?”, “What is X?” or “Who knows about X?”. These types of queries simply seek information about a subject matter or person. Transactional queries typically involve the user asking a question about how to accomplish some sort of transaction, such as “Where do I submit an expense report?” or “Where can I shop for books?”. The results sought by the user are often a destination or a description of a procedure of how to accomplish the desired transaction. Navigational queries involve the user requesting a destination link such as “Where is the homepage of X?” or “What is the URL for X?”. With navigational queries, the user is typically seeking, as a result, a web page address or other similar link.
The present invention is a system for answering questions. The present invention uses a data mining module to mine data, such as enterprise data, and to configure the data to answer a predetermined number of questions, each having a predefined form. The present invention also provides a user interface component for receiving user queries and responding to those queries.
The present invention deals with a question answering system. More specifically, the present invention deals with a data mining module that mines data and a user interface that utilizes the mined data in order to perform question answering. However, before describing the present invention in greater detail, one illustrative embodiment of an environment in which the present invention can be used will be discussed.
The invention is operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well-known computing systems, environments, and/or configurations that may be suitable for use with the invention include, but are not limited to, personal computers, server computers, hand-held or laptop devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, telephony systems, distributed computing environments that include any of the above systems or devices, and the like.
The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The invention is designed to be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules are located in both local and remote computer storage media including memory storage devices.
With reference to
Computer 110 typically includes a variety of computer readable media. Computer readable media can be any available media that can be accessed by computer 110 and includes both volatile and nonvolatile media, removable and non-removable media. By way of example, and not limitation, computer readable media may comprise computer storage media and communication media. Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by computer 110. Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media. Combinations of any of the above should also be included within the scope of computer readable media.
The system memory 130 includes computer storage media in the form of volatile and/or nonvolatile memory such as read only memory (ROM) 131 and random access memory (RAM) 132. A basic input/output system 133 (BIOS), containing the basic routines that help to transfer information between elements within computer 110, such as during start-up, is typically stored in ROM 131. RAM 132 typically contains data and/or program modules that are immediately accessible to and/or presently being operated on by processing unit 120. By way of example, and not limitation,
The computer 110 may also include other removable/non-removable volatile/nonvolatile computer storage media. By way of example only,
The drives and their associated computer storage media discussed above and illustrated in
A user may enter commands and information into the computer 110 through input devices such as a keyboard 162, a microphone 163, and a pointing device 161, such as a mouse, trackball or touch pad. Other input devices (not shown) may include a joystick, game pad, satellite dish, scanner, or the like. These and other input devices are often connected to the processing unit 120 through a user input interface 160 that is coupled to the system bus, but may be connected by other interface and bus structures, such as a parallel port, game port or a universal serial bus (USB). A monitor 191 or other type of display device is also connected to the system bus 121 via an interface, such as a video interface 190. In addition to the monitor, computers may also include other peripheral output devices such as speakers 197 and printer 196, which may be connected through an output peripheral interface 195.
The computer 110 is operated in a networked environment using logical connections to one or more remote computers, such as a remote computer 180. The remote computer 180 may be a personal computer, a hand-held device, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to the computer 110. The logical connections depicted in
When used in a LAN networking environment, the computer 110 is connected to the LAN 171 through a network interface or adapter 170. When used in a WAN networking environment, the computer 110 typically includes a modem 172 or other means for establishing communications over the WAN 173, such as the Internet.
The modem 172, which may be internal or external, may be connected to the system bus 121 via the user input interface 160, or other appropriate mechanism. In a networked environment, program modules depicted relative to the computer 110, or portions thereof, may be stored in the remote memory storage device. By way of example, and not limitation,
The present description will proceed with respect to a question answering system that answers the questions “What is X?”, “Who is X?”, “Who knows about X?”, and “Where is the homepage of X?” where “X” is entered by the user. However, it will be appreciated that fewer, different, or additional questions can be answered as well while maintaining the inventive concept of the present invention. For instance, the present invention can be used to answer questions such as “I need to do X”, “How to do X”, etc. However, the present invention maintains the number of questions allowed to a predetermined, relatively small number, such as approximately ten or fewer, and maintains the form of the questions as one of a number of predefined forms. Again, the present discussion proceeds with respect to the four questions having the predefined form mentioned above, but this is by way of example only.
In operation, briefly, text mining component 202 receives access to source documents 206 through network 203. Text mining component 202 illustratively includes metadata extraction component 210, relationship extraction component 212 and domain-specific knowledge extraction component 214. As is described in greater detail below, metadata extraction component 210 receives text form source documents 206 and extracts relevant metadata to be used in answering questions. Relationship extraction component 212 also receives the text from source documents 206 and the output from metadata extraction component 210, and extracts relationship information which is used in answering questions. The information from components 210 and 212 is provided to knowledge database 204 and is stored in a metadata and relationship knowledge store 216 for answering questions such as “Who knows about X?” and “Who is X?”, where “X” is input by the user.
As is also described in greater detail below, domain-specific knowledge extraction component 214 extracts domain-specific data from source documents 206 and provides it to domain-specific knowledge store 218 in knowledge database 204. The domain-specific information in knowledge store 218 is used, for example, to answer questions such as “Where is the homepage of X?” and “What is X?”.
Question answering UI component 208 receives a user input query 220 and accesses knowledge base 204 to provide the user with an answer to the question. In one illustrative embodiment, question answering UI component 208 allows the user to select one of a predetermined number of predefined queries, or determines which of those predetermined, predefined queries the user is attempting to invoke. By limiting the number of queries to a predetermined number, and by limiting the specific form of the queries allowed to be one of a number of predefined forms, the present invention can answer nearly all queries requested by users, but avoids a number of the significant disadvantages associated with prior art question answering and information retrieval systems.
In another optional embodiment, UI component 208 is also coupled to a conventional IR system 221. System 221 illustratively employs a conventional IR search engine and accesses data in a conventional way (such as through a wide area network, e.g., the internet, or a local area network) in response to the input query. Thus, UI component 208 can integrate or otherwise combine question answering results from database 204 with conventional search results from system 221 in response to user input query 220.
In the embodiment described herein, definition extraction model 230 is illustratively a statistical binary classifier which extracts from the text in source documents 206 all paragraphs which can serve as a definition of a concept. The classifier is trained by annotating training data and feeding that training data into a statistical classifier training module, which can implement one of a wide variety of known training techniques. One such training technique is well-known and trains the statistical classifier as a support vector machine (SVM). In accordance with that technique, features are obtained which are used to classify the text under consideration to determine whether it is a definitional paragraph. A wide variety of different features can be used by the classifier and one illustrative definition extraction feature list is illustrated in Table 1 below.
The features illustrated in Table 1 identify such things as whether the first phrase in a paragraph is a noun phrase, and whether that noun phrase occurs frequently within the paragraph. If so, the paragraph is probably a definitional paragraph. The features also identify such things as whether pronouns occur in the main phrase of the paragraph. If so, it is probably not a definitional paragraph. Other features are illustrated as well, and they are each associated with a score.
Table 1 shows the category of each of the features listed, along with the number of bits associated with each feature, and the weight corresponding to each feature. The features are broken into categories of features that correspond to the main phrase of the text, those that correspond to the entire paragraph of the text, and those that correspond to the group of words which comprise the text. Of course, additional or different features can be used as well, they can be categorized differently, and they can be given different weights. Those illustrated in Table 1 are provided by way of example only. It should also be noted that where the weight is listed as “rule”, that indicates that the weight is determined by a subsidiary rule which is applied to the particular text fragment.
In answering questions about definitions, definition extraction model 230 also illustratively ranks the definitions of concepts based on how closely the definitions correspond to the concepts. Therefore, when the user asks the question “What is X?”, the definitional paragraphs extracted for “X” will be ranked in order of their relevance. Definition extraction model 230 thus outputs the results of processing source documents 206 as <concept, definition> pairs where the “concept” identifies the concept which is defined, and the “definition” provides the definition of that concept. These pairs are stored in domain-specific knowledge store 218, where multiple definitions for a single concept are illustratively ranked by relevance.
Acronym extraction model 232 illustratively includes patterns 236 and filtering rules 238. Acronym extraction model 232 illustratively receives source documents 206 and identifies acronyms, and the expansions of those acronyms, and generates <acronym, expansion> pairs which are also stored in domain-specific knowledge store 218. Identifying the acronyms and expansions and generating the pairs is illustratively viewed as a pattern matching problem. Therefore the text in source documents 206 is matched to patterns 236 and the matches are filtered using filtering rules 238 in order to obtain the acronym, expansion pairs. This is illustrated in greater detail in
Once candidate acronym, expansion pairs have been identified using the patterns shown in Table 2, the filtering rules are applied to each of the candidate acronym, expansion pairs. This is indicated by blocks 244, 246 and 248 in
Homepage extraction model 234 can illustratively be a pattern matching model or a statistical model, as desired. Of course, other ways for identifying homepages in source documents 206 can be employed as well. For instance, if the tool used to create the web page has an attribute or identifier which identifies a particular page as the “homepage”, model 234 can simply review that attribute of the page to determine whether it is a homepage.
In the embodiment in which homepage extraction model 234 is a binary classifier, the classifier is trained from labeled training data, using any suitable statistical classifier training technique. The classifier is trained to determine whether a web page is a homepage associated with a group or person, for instance.
In the embodiment shown in
It should also be noted that the metadata to be extracted may be contained in actual metadata fields associated with source documents 206. However, it has been found that such metadata is often inaccurate. In fact, it has been found that, in some instances, the metadata associated with source documents 206 is inaccurate as much as 80 percent of the time. Therefore, the present invention uses component 210 to extract metadata, such as author, title and key terms, from the content of the source documents 206, as opposed to any metadata fields associated with those documents.
In the embodiment discussed herein, the extraction of author and title information from source documents 206 is performed by author extraction model 260 and title extraction model 262. Models 260 and 262 are illustratively statistical classifiers that are trained to determine whether several consecutive lines comprise an author or title. Also, in one exemplary embodiment, for HTML documents, only titles are extracted, although other information could be extracted as well.
One exemplary feature list used by author extraction model 260 is shown in Table 3.
Again, the features shown in Table 3 are identified by category, by the specific feature used, by the bits associated with each feature (i.e., the number of bits used to identify whether the feature is present or absent in the text being processed) and the weight associated with that feature. It can be seen from Table 3 that the weights may vary depending on the type of document being processed. For instance, if the document is a word processor document, the weights may have one value while if the document is a presentation (such as slides), the weights may have a different value.
Title extraction model 262 may illustratively be comprised of two models which are used to identify the beginning and ending of a title in a text fragment. Table 4 is a feature list for title extraction model 262 when it is implemented as a statistical classifier. In one illustrative embodiment, title extraction model 262 receives text fragments from the first page of word processing documents and from the first slide of slide presentations.
Table 4 illustrates the category, feature, number of bits corresponding to each feature, and the weights associated with each feature. In Table 4, weight one corresponds to the first model that identifies the beginning of a title, and weight two corresponds to the second model that identifies the end of the title. It can also be seen that the weights corresponding to each feature may also vary based on the type of document being processed.
Key term extraction model 264 is used to extract key terms from the source documents 206. The key terms are illustratively indicative of the contents of a given document being processed. These terms illustratively identify the concepts being described in the document. Model 264 can use any of a wide variety of different techniques for identifying key terms or content words in a document. Many such techniques are commonly described for indexing documents in information retrieval systems. One such technique is the well-known term frequency * inverse document frequency (tf*idf). However, other techniques simply include examining the position and frequency of a term. If the term tends to appear at the beginning of a document and is used frequently throughout the document, then it is likely a key term.
Relationship extraction model 212 receives the outputs from models 260, 262 and 264 and also receives source documents 206. Relationship extraction model 212 generates <concept, person> pairs that identify relationships between people and concepts. These pairs can be used, for instance, to answer questions such as “Who knows about X?”, and “Who is X?” In order to generate these types of pairs, relationship extraction model 212 determines, for instance, whether a “concept” and a “person” appear in the title and author portions of the same document, respectively. If so, then the concept, person pair is created. Model 212 also determines whether a “concept” and “person” appear in the key term and author portions of the same document, respectively. If so, the concept, person pair is created. Similarly, model 212 can determine whether a “concept” and “person” co-occur frequently within a document collection. If so, the pair is created as well. Of course, additional or different tests can be used to determine whether a concept, person pair should be created.
Once knowledge stores 216 and 218 are created, question answering UI component 208 can be used to answer queries provided by a user. UI component 208 can be integrated into system 200 in any of a wide variety of ways. A number of these ways will be described below. Suffice it to say, for now, that UI component 208 receives a query which is one of the four queries discussed above (“Who is X?”, “Who knows about X?”, “What is X?”, and “Where is the homepage of X?”).
First, UI component 208 determines which of these two questions is being asked by the user. This is indicated by block 270 in
Assuming that component 208 identifies the question as “Who is X?”, then component 208 accesses the documents that are authored by the person “X”. This is indicated by block 272 in
Component 208 also accesses documents that mention the person “X”. This is indicated by block 274 in
Component 208 then accesses relevant key terms. This is indicated by block 276. Relevant key terms are those terms which appear in the documents authored by the author “X” or in the documents that mention “X”.
After accessing all this information, component 208 creates a profile of the person “X”. This is indicated by block 278 in
One illustrative embodiment of an output from UI component 208 in answering the question “Who is John Doe?” is illustrated in
Returning again to
It should be noted that, as discussed previously, UI component 208 can access IR system 221 based on the user input and return IR search results as part of the question answering results. The IR results may be requested by the user by checking an appropriate box, or the IR results can be generated automatically.
It will be appreciated that UI component 208 can be integrated into system 200 in one of a variety of different known ways. One of those ways is illustrated by
Of course, other techniques can be used as well. For instance, if the user types in the entire query, and it is ambiguous, the present invention can return responses to all four different queries, if they are relevant. This is also illustrated in
Similarly, UI component 208 can be integrated into system 200 by training a model to determine the form of the query based on the user's input. For instance, such a model may be a four way classifier which is applied to ambiguous inputs in order to classify the query into one of the four predetermined forms. Similarly, the present system can be implemented to engage in a dialog with the user, to disambiguate the input and specifically identify the form of the query which the user desires. The dialog can request more information from the user or provide suggestions to the user such as check spelling, try using synonyms, etc.
It can thus be seen that the present invention greatly simplifies the question answering process and yet still covers a vast majority of different types of questions that the user may wish to ask. By limiting the number of different forms of query to a predetermined number having predefined forms, the present invention can quickly and easily mine text and generate and store data structures or records that are suitable for answering those limited number of different query types. In other words, because the present system knows the form in which the queries will be presented, and because the number of allowed forms is relatively small, it can easily arrange the data in the data stores that represent the mine text in a form that is highly suitable for answering those queries.
Although the present invention has been described with reference to particular embodiments, workers skilled in the art will recognize that changes may be made in form and detail without departing from the spirit and scope of the invention.