The present invention relates to techniques for searching and/or browsing document collections and, in particular, to techniques which use the semantic roles of search terms.
Browsing a large collection of electronic text (e.g., sentence fragments, sentences, paragraphs, and entire documents) to find relevant information can be extremely difficult; so difficult that in most cases search functionality is used instead of browsing. In search, the user is required to enter keywords which are then used to rank the matching items in the collection. Unfortunately, this approach has its own limitations. For example, search requires the user to know the appropriate keywords in advance. In addition, search suffers from the usual problems of natural language, e.g., synonymy, polysemy, etc. Moreover, search tools provide very little feedback to the user when the search is off the mark.
Browsing, on the other hand, does not require the user to choose keywords. Instead, the user is presented with choices of increasing specificity. For example, in shopping web sites, users naturally browse the collection of products by successively choosing more specific product categories, e.g., electronics→camera→zoom. Each time a user makes a choice, he is presented with a list of results and/or additional choices. This feedback can be very useful because it informs the user about the contents of the collection. It is particularly useful when the user does not have a clear idea of what he is looking for (i.e., a fully specified information need), and is learning as he browses. For instance, in the camera example, the user may realize because of the feedback provided during browsing that there are two types of zooms, i.e., analog and digital.
Unfortunately, there are many cases in which there is no natural taxonomy of the documents in a collection. In such cases, browsing interfaces can be extremely frustrating for the user. And where editor-created taxonomies do exist, they are typically either too general or too specific for the needs of a given user, and/or they may organize information differently from what the user expects. Hierarchical clustering of documents is another alternative to taxonomies that has been used with some success, but has its own drawbacks and often leads to frustrating user experiences.
An alternative to browsing categories or clusters is navigation by keyword selection. An example of this is the use of tag clouds in which the most important tags of a collection are shown to the user. When the user selects a tag, this selection is translated into a restriction (or query) and the collection is restricted to documents containing the selected tag. This approach works well on small, homogeneous collections that are heavily tagged by users. However, it has serious drawbacks in that it cannot be applied to collections which have not been tagged, and degrades rapidly for large or heterogeneous collections.
The shortcoming associated with at least some of the foregoing techniques are further exacerbated where the document collection includes sentence fragments, sentences, and/or short paragraphs in that such documents do not typically have associated metadata, e.g., titles, tags, categories, etc.
According to the present invention, methods and apparatus are provided for searching and/or browsing document collections using semantic roles of terms. According to a first class of embodiments, methods and apparatus are provided for searching a collection of documents. A search interface is provided in which a user specifies a search query including a first keyword. One or more suggested keywords are provided in the search interface. The suggested keywords being included in first ones of the documents in which the first keyword is also included. Each of the suggested keywords has a predetermined semantic relationship with the first keyword in one or more of the first documents. Each of the suggested keywords has one or more associated semantic roles explicitly identified in the search interface. A mechanism is provided in the search interface by which the user refines the search query by selecting one of the suggested keywords in a particular semantic role.
According to a second class of embodiments, methods and apparatus are provided for searching a collection of documents. Each of the documents has an associated semantic representation in which each of a subset of the terms in the associated document has a corresponding semantic role. A first keyword is received from a user device. First ones of the semantic representations including the first keyword are identified. One or more suggested keywords having predetermined semantic relationships with the first keyword are identified with reference to the first semantic representations. The suggested keywords are transmitted to the user device.
A further understanding of the nature and advantages of the present invention may be realized by reference to the remaining portions of the specification and the drawings.
Reference will now be made in detail to specific embodiments of the invention including the best modes contemplated by the inventors for carrying out the invention. Examples of these specific embodiments are illustrated in the accompanying drawings. While the invention is described in conjunction with these specific embodiments, it will be understood that it is not intended to limit the invention to the described embodiments. On the contrary, it is intended to cover alternatives, modifications, and equivalents as may be included within the spirit and scope of the invention as defined by the appended claims. In the following description, specific details are set forth in order to provide a thorough understanding of the present invention. The present invention may be practiced without some or all of these specific details. In addition, well known features may not have been described in detail to avoid unnecessarily obscuring the invention.
Embodiments of the present invention enable a user to develop and refine a search query by constraining the semantic role of specific keywords. According to a particular class of embodiments, a search/browsing interface is provided which suggests additional keywords corresponding to documents in the collection being searched which also include the particular keyword in the semantic role to which it is being constrained. In this way, the user can readily identify and incorporate additional search terms into his query which relate to the topic of interest; even where he was not initially sure what he was looking for. The suggested keywords are identified using an underlying document model in which documents are represented with reference to important terms and their corresponding semantic roles.
As used herein, the term “document” refers to any electronically stored and searchable body of text including, for example, phrases, sentence fragments, sentences, collections of sentences, paragraphs, collections of paragraphs, abstracts, summaries, titles, metadata descriptions, entire documents, etc. Such bodies of text may be embodied in a variety of forms including, for example, plain text, web pages, word processing documents, pdf files, metadata associated with other documents or media, etc.
A particular embodiment of the invention will now be described in which the collection of documents being searched and/or browsed is the Yahoo! Answers collection. It should be noted, however, that this collection is being used herein for illustrative purposes only, and that the invention is not so limited. That is, embodiments of the present invention may be used to discover information in any of a wide variety of document collections that include documents having a wide variety of characteristics.
The Yahoo! Answers collection includes user-generated questions which are associated with corresponding answers also generated by Yahoo! users. A user of Yahoo! Answers may pose virtually any question relating to virtually any topic and have that question answered by one or more other Yahoo! users relatively quickly. The user may also search the collection or browse a hierarchy of categories in a conventional manner to determine whether someone has already posed the question, and whether any satisfactory answers were provided. However, given the large number of questions and answers in the collection, these conventional approaches may suffer from the limitations described above. Therefore, an embodiment of the present invention by which a user may more readily identify relevant information in such a collection will now be described with reference to the flowchart of
Each of the sentences in the collection is linguistically analyzed and tagged to identify the most important elements of the sentence (102). Semantically, the most important elements of a sentence are an “action” to which the sentence refers, and the “patient” and “agent” of the action. Therefore, according to a specific embodiment of the invention, each sentence in the collection is represented by at least one triplet comprising these elements, i.e., <agent, action, patient> also referred to herein as a “frame” (104). This document model may be understood with reference to the sentence “Last week, Corporation A sued Corporation B for patent infringement.” This sentence would be parsed and then represented with the triplet <Corporation A, to sue, Corporation B>.
Other sentence components having various semantic roles may also be identified for particular sentences. Such components and semantic roles may include, for example, temporal qualifiers, spatial qualifiers, geographic modifiers, and modality qualifiers. Modern statistical parsers are able to detect these elements automatically in a sentence with reasonably high accuracy. An example of a statistical parser which may be employed with various embodiments of the invention is described in Combination Strategies for Semantic Role Labeling, Mihai Surdeanu, Lluis Marquez, Xavier Carreras, and Pere R. Comas, Journal of Artificial Intelligence Research 29 (2007), the entire disclosure of which is incorporated herein by reference for all purposes. And as documents are added to the collection, they may be analyzed, tagged, and represented in the same way (106). As will be understood, and as represented by the dashed line, the linguistic analysis of the document collection typically occurs at a different time (e.g., offline), and typically independently of the use of the resulting document model in facilitating information discovery as described below.
By determining the actions, patients, and agents of each of the documents in the collection, the techniques enabled by the present invention are then able to leverage this document model to suggest interesting search terms to the user which help the user to better understand the contents of the collection as well as strategies for suitably specifying and/or refining his query. As will be described, users select or specify not only search terms (108), but may also constrain the semantic roles of one or more of their search terms (110). New terms are then suggested to the user that “match” the semantic context in some way (112). For example if a user chooses an action such as “wash,” new terms such as “dishes” and “cars” may be presented as possible patients of the action. Alternatively, if the user chooses the patient “car,” actions such as “buying,” “washing,” “repairing,” etc., may be suggested. This feedback helps the user understand what types of documents can be found in the collection and to refine his query as needed. As will be described in greater detail below, the user's interactions iteratively constrain the semantic context to zero in on the information of interest (114).
As will be understood, if a document is larger than a single sentence, e.g., a compound sentence, a collection of sentences, a paragraph, a lengthy documents, etc., there may be multiple frames associated with the document, i.e., one for each triplet identified. In some cases, there may be multiple frames for a single sentence. For example, the sentence “I kick the ball and you stop it” includes two frames, <I, kick, ball> and <you, stop, ball>. According to some embodiments, the number of frames for a given document can be reduced by determining which frames are more relevant to the content focus of the document.
According to a particular implementation, once a collection is analyzed and tagged, the collection and the underlying document model may be conceptualized as three relational tables T, S, and R in which the rows of the tables are given by:
T: (documentId, document_text)—i.e., the text of a particular document (e.g., sentence).
S: (frameId, documentId)—i.e., the document in which a frame appears.
R: (term, frameId, role)—i.e., the frame and semantic role of a particular term.
Various modes of interaction with search interfaces implemented according to the invention are contemplated. Two interrelated modes are described below with reference to the examples illustrated in screen shots of
A frame condition is a set of constraints on query terms and their roles that can be used to select a subset of the collection, e.g., sentences containing roles that satisfy the constraints, as well as to filter and/or reorder the terms presented in the columns of terms. For example, a frame condition specifying documents in which the term “wash” is an action and the term “shirt” is a patient may be expressed (Action=wash, Patient=shirt). The expression R[Action=wash] denotes the rows in R that contain both the term “wash” and the semantic role “action,” i.e., identifies any frameId for which this is true. User search and browsing actions (e.g., entering or clicking on a word) are translated into frame conditions which are then used to select corresponding documents and/or suggest additional search terms in accordance with specific embodiments of the invention.
Term filtering ranking functions measure the interest of a term being used as a new frame condition. For example, for R[Verb=wash] the term interest for (Patient,car) may be 0.1, and for (Patient, sock) 1.0, indicating that sock is likely a more interesting choice for the user than car. An example of a term interest function is the number of frames in which (Role,term) appears in R[frame condition].
Term relatedness ranking functions measure how semantically related two terms are. According to one approach, semantically related terms may be identified by identifying different terms in a particular semantic role which have the same terms in one or both of the other semantic roles in a triplet, e.g., another action with the same agent and patient. For example, for R[Verb=wash] the term relatedness for (Verb,clean) may be 1.0, and for (Verb,buy) may be 0.1, indicating that clean is likely more related to wash than to buy. An example of a term relatedness function for two verbs is the number of patient roles that are shared in R[frame condition].
As shown in the examples of
When a user selects a term in one of the browsing columns, e.g., “people” in the agent column of
It should be noted that the browsing columns and the order of the terms in each can be predetermined, or generated on the fly as the user makes selections. Columns can also be merged. For example, the browsing columns for Patient and Agent could be merged given that there are sentence structures which include only one entity associated with an action.
According to some implementations, traditional search functionalities can be integrated with user interfaces implemented according to the invention. That is, as discussed above, when a user types keywords in the search text box, the keywords may be translated into frame conditions, e.g., with the keyword being the term and the role being empty (i.e., it is not yet clear what semantic role the user intends). For example, if the user enter “dog” in the search box, this may be represented as the frame condition (ANY_ROLE, dog) and logically ANDed with any other conditions. In this way R[frame condition] can be updated naturally in response to this mode of interaction with the interface.
According to specific embodiments, various types of modifiers (e.g., temporal, modality, spatial, geographic, etc.) may be surfaced if they have some statistical significance to the current context. For example, many sentences might share a common action but have very different temporal modifiers, e.g., today, this week, next week, next year, last month, etc. In such cases it may be useful to provide these options to the user for refining of her query.
The concept of tags (e.g., people, places, etc.) may also be employed with particular implementations to leverage the underlying document model. For example, entering the keyword “Pablo Picasso” will identify all documents in the collection in which this keyword appears as either an agent or a patient. The user may then be given the option of selecting a tag or category, e.g., people or places associated with Pablo Picasso. Because each of the documents has been parsed to derive representative triplets, and if each of the terms in the triplet also has associated tags, it is possible then to identify from the triplets all people or places that are identified with Pablo Picasso in the document collection.
Some examples of the foregoing functionalities may be illustrative. If the user enters the search term “husband” in the search text box as shown in
In another example shown in
By contrast, when the user selects “baby” to be an agent as shown in
In the example illustrated in
In the example illustrated in
Embodiments of the present invention may be employed to provide search and browsing services for document collections in any of a wide variety of computing contexts and using any of a wide variety of technologies. For example, as illustrated in
The invention may also be practiced in a wide variety of network environments (represented by network 1212) including, for example, TCP/IP-based networks, telecommunications networks, wireless networks, etc. In addition, the computer program instructions with which embodiments of the invention are implemented may be stored in any type of tangible computer-readable media, and may be executed according to a variety of computing models including a client/server model, a peer-to-peer model, on a stand-alone computing device, or according to a distributed computing model in which various of the functionalities described herein may be effected or employed at different locations.
While the invention has been particularly shown and described with reference to specific embodiments thereof, it will be understood by those skilled in the art that changes in the form and details of the disclosed embodiments may be made without departing from the spirit or scope of the invention. In addition, although various advantages, aspects, and objects of the present invention have been discussed herein with reference to various embodiments, it will be understood that the scope of the invention should not be limited by reference to such advantages, aspects, and objects. Rather, the scope of the invention should be determined with reference to the appended claims.