Within the field of computing, many scenarios involve a content set comprising one or more content items, such as a set of files in a filesystem, a set of email messages in an email mailbox, and a set of contact records in an address book. Such content items may be identified through many identifiers, such as a name, a location within the content set, a user indicated as an owner or creator of the content item, or one or more topics addressed by the contents of the content item.
Within such content sets, a user may wish to search for a particular content item. A user may therefore provide a query comprising one or more keywords, such as a portion of a filename of a file representing the content item or one or more words that appear in an email message. In order to evaluate such queries, a search algorithm may therefore index respective content items of one or more content item sets according to various keywords associated with the content item, e.g., according to the filenames of files in a filesystem or words appearing in the subject or body of email messages in an email mailbox. A search algorithm may therefore apply the query to the content item sets, e.g., by using the search index to identify content items having the keywords in the filename or in the contents of the message, and may present to the user a set of candidate content items matching the query. The search algorithm may therefore apply the query in an efficient manner and may rapidly return results to the user.
This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key factors or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
While the evaluation of a query comprising a set of keywords through the use of a search index that indexes the content items may be efficient, the results returned by such search algorithms may be inadequately selective or helpful. As a first example, it may be difficult to use these techniques to select for a keyword that appears often in the content items. In one such scenario, a user may wish to search for a contact record for an individual with the last name of Plant, but if the user is interested in gardening, a large number of content items may incidentally include the term “plant” and may appear in the search results, thereby obscuring the search result relating to the contact record that is sought by the user. As a second example, it may be difficult to apply some queries to the content items indexed in the search index, such as queries for short words (e.g., a search for a contact record of an individual with the last name Su may turn up a large number of content items featuring the letter combination “Su”) and queries based on the initials of an individual (e.g., a search for users with the initials “C C” may produce a result set featuring a name including the letter “C”).
However, it may be possible to interpret the query based on the implied and inferred intent of the user in formulating the query. Thus, rather than simply applying a rote matching of the terms of a query with any identifiers of the entire content item, content items may be indexed based on the likelihood of a user searching for a particular content item based on a particular field. As a first example, it may be appreciated that a user is more likely to search for a content item based on some identifiers (e.g., metadata fields associated with a user name, a filename, or an email message title) than other identifiers (e.g., a small segment of text in a long document). As a second example, a search using the initials “C C” may be inferred as searching for individuals having a name with these initials, or for documents or other files containing a series of words beginning with these letters (such as “carrot cake”). Accordingly, techniques may be devised to index content items according to the manner whereby a user may choose to search for the content item, and to apply a query while searching for content items based on the inferred intent of the user while formulating the query. Such techniques may therefore present search results, may order the search results, in a manner that is higher relevance to the user based on the inferred intent of the query.
Presented herein are techniques for evaluating a query against a content set, comprising various content items (such as locally stored objects of various types, e.g., files in a filesystem, email messages in an email mailbox, and contact records in an address book), that may more robustly evaluate the query and may present more selective search results that may be more highly tailored to the intended meaning of the query. In accordance with these techniques, content items may be indexed in a context index according to various identifiers (e.g., a filename or portion of a filename of a file; the sender email address, recipient email address, and subject keywords of an email message; and a first name, last name, nickname, full name, and email address of a contact record in an address book), but each identifier may be associated with an identifier weight that indicates the likelihood of a user searching for the content item by using the identifier. When a user enters a query, the tokens of the query may be matched with different identifiers associated with different content items, and the candidate content items (those indexed with identifiers matching the tokens of the query) may be sorted according to the weights of the associated identifiers. Moreover, if the query is entered in a particular search context (e.g., a query entered into an email client), it may be inferred that the user may be devising the query in the search context, and may be choosing the terms of the query based on identifiers associated with the search context. Therefore, the identifiers that are associated with the search context (e.g., a Sender field or a Subject field that is more heavily associated with email messages) may be weighted more heavily in computing the rank scores, increasing the likelihood that the retrieved content items may be more relevant to the user due to the search context wherein the user entered the query.
For example, a user entering the query “Su” may match a contact having a last name of “Su”, a second contact having the first name “Susan”, a file named “Grocery List” including the term “sugar”, and an email message including the word “surgery” in the subject. Some search algorithms may present all of these content items as search results, possibly sorted by an arbitrary criterion (e.g., alphabetically or by date of creation). However, in accordance with the techniques presented herein, the indicators whereby each content item is indexed are associated with weights indicating the likelihood that a user entering the query “Su” intended to locate the content item. Therefore, the contact with the last name “Su” (which exactly matches the query) may be presented as a first search result, indicating a high predicted likelihood that the user is searching for this content item (in view of the exact match with a frequently searched property of the content item); the contact with the first name “Susan” and the email message including the term “surgery” may be presented as second and third search results, indicating a medium predicted likelihood that the user is searching for these content items (in view of a partial match with infrequently searched properties of these content items); and the file named “Grocery List” and including the term “sugar” may be presented as the last search result, indicating a low predicted likelihood that the user is searching for this content item (in view of the match with an infrequently searched property of the content item). The search results are therefore presented in a more selective manner, based on the predicted intent of the user in providing “Su” as a token of the query.
As further provided herein, additional techniques may be applied that may further improve the selectivity of the search algorithm in identifying the predicted intent of the user while formulating the query. For example, For example, the search context may be considered while evaluating the predicted relevance of various indicators. For example, if the query “Su” is entered in the context of a search for an individual (e.g., a search initiated in relation to the “To:” field of an email message, or within an address book application), it may be inferred that content items matching the query on a name-related field are likely to be of higher predicted relevance (e.g., further weighing the contacts with the last name “Su” and first name “Susan” over other content items). However, if the user initiates the query in the context of a communication content search (e.g., in the context of a search on a message body), the email message including the term “surgery” may be more highly weighted; and if the user initiates the query in the context of a file content search, the “Grocery List” file containing the word “sugar” may be more highly weighted. Thus, the context of the search may be utilized to adjust the weights of the identifiers matching the query, in order to improve the predicted relevance to the user of the selection and ranking of search results.
As another (alternative or additional) technique, the weights of the search terms may be adjusted based on the correspondence with the sequential order of the tokens of the query with the sequential order of matching portions of the identifier (e.g., for a query comprising the tokens “jo st”, preferentially presenting the search result “Joe Stone” over the search result “Steve Jones”); based on the matching of a token with multiple indicators (e.g., for a query comprising the token “an”, preferentially presenting the search result “Ann Anderson” over the search result “Ann Smith”); and based on the complete matching of a token with an identifier (e.g., for a query comprising the token “Michael”, preferentially presenting the search result “Joe Michael” over the search result “Steve Michaelson”). Such heuristics may promote the presentation of search results in an order that is more likely to conform to the intended meaning of the query formulated by the user than an arbitrary sorting of search results (e.g., by alphabetic order or by date of creation). Additionally, such heuristics may be comparatively simple, such that the adjustment may be made in realtime without significantly prolonging the evaluation of the query or delaying the presentation of search results in response thereto.
To the accomplishment of the foregoing and related ends, the following description and annexed drawings set forth certain illustrative aspects and implementations. These are indicative of but a few of the various ways in which one or more aspects may be employed. Other aspects, advantages, and novel features of the disclosure will become apparent from the following detailed description when considered in conjunction with the annexed drawings.
The claimed subject matter is now described with reference to the drawings, wherein like reference numerals are used to refer to like elements throughout. In the following description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the claimed subject matter. It may be evident, however, that the claimed subject matter may be practiced without these specific details. In other instances, structures and devices are shown in block diagram form in order to facilitate describing the claimed subject matter.
Within the field of computing, many scenarios involve a content set comprising various content items, such as a filesystem comprising one or more files, an email system comprising one or more email messages, and an address book featuring one or more contact records. These content sets may be stored locally (e.g., on a memory of a device operated by a user), remotely over a local area network (e.g., on a network file server), or remotely over a wide area network (e.g., on various servers connected to the Internet). Each of these content sets may store the content items in a particular manner (e.g., the filesystem may store files in a hierarchical manner; the email mailbox may store email messages in one or more folders; and the address book may store all contact records together as an unorganized set). The items of each content set may also be structured in various ways, featuring various types of metadata that semantically identify the content item (e.g., files in a filesystem may have a name, a location within the hierarchy of the filesystem, a creation date, and a file type; email messages in an email mailbox may have a sender email address, a subject, and a date of delivery; and contact records in an address book may have a full name, a mailing address, and a profile picture). These various properties may serve as identifiers, whereby a user may distinctively identify and reference a particular content item.
Within such scenarios, a user may wish to search for one or more content items that meet particular criteria. For example, a user may wish to search for content items associated with the name of a colleague, such as files created, owned by, or referencing the colleague, email messages exchanged with or discussing the colleague, and one or more contact records for the colleague. Therefore, a user may submit a query, comprising one or more keywords that may be related to the identifiers of the content items that the user seeks. A device operated by the user that has access to the content items may therefore apply the query in various ways to the content items of the content sets, and may generate a result set comprising the candidate content items that have been identified as matching the query provided by the user. For example, upon receiving a particular query comprising a set of keywords from a user, the device of the user may examine all available content sets for content items matching all of the keywords, and may present the matching candidate content items to the user in response to the query.
In many such scenarios, the number of content items 22 stored in the content sets 20 against which a user 12 may submit a query 14 may be large. Performing a thorough ad hoc search of each content item 22 in a content set 20 may therefore be very time-consuming, resulting in a significant delay in providing the result set of candidate content items to the user 12 in response to the query 14. Therefore, many devices 18 and content sets 20 are configured to generate, maintain, and utilize a search index, representing an index of the identifiers of each content item 22 in a rapidly searchable data structure (e.g., a hashtable). When the device 18 receives a new content item 22 or an update to a content item 22, the device 18 may examine the content item 22 for identifiers associated with the content item 22 that might subsequently be entered as keywords 16 in a query 14, and may index the content item 22 in the search index according to the identifiers. When the device 18 later receives a query 14 from a user 12, the device 18 may refer to the index to identify the content items 22 associated with each keyword 16 of the query 14, and may rapidly identify and present to the user 12 the candidate content items for the query 14.
However, while many search algorithms 32 may correctly identify content items 22 matching the keywords 16 of the query 14, the search results 36 may nevertheless be unsatisfying or unhelpful to the user 12. As a first example, if many content items 22 match the query 14, the search results 36 may be voluminous, and it may be difficult for the user 12 to identify and the content items 22 of interest from the candidate content items 38 of the search results 36. As a second example, many content items 22 may incidentally match a particular keyword 16 in ways that the user 12 may not have intended. For example, the user 12 may wish to search for an individual having the last name of “Plant,” and may therefore submit a query 14 including the keyword “plant”. However, if the user 12 is employed as a gardener, many content items 22 (e.g., files and email messages) in the computing environment of the user 12 may include the keyword “plant” and may therefore be identified as candidate content items 38, even if this is not the intended meaning of the term to the user 12. As a third example, the device may be incapable of applying some keywords 16 to the content items 22 of the content sets 20, even with the use of a search index 34. For example, the search index 34 may index content items 22 according to identifiers having a minimum length, e.g., of three alphanumeric characters, because shorter identifiers may match a large number of content items 22. The user 12 may therefore be unable to submit a query 14 for an individual having the last name “Su,” as this keyword 16 may be too short to be evaluated by the search index 34. As a fourth example, the device may not be configured to evaluate particular types of queries, such as queries for individuals having the initials “C C”. In these and other scenarios, the user 12 may be unable to submit a desired query 14, and/or may have difficulty identifying the content items 22 of interest among a large set of candidate content items 38.
It may be appreciated that a significant cause of the inefficiency of comparatively simple techniques for applying a query 14 to one or more content sets 20 relates to the incapability of the evaluation of the relevance of the matched identifiers in a content item 22 to the keywords 16 of a query 14. For example, in the exemplary scenario 30 of
In accordance with this observation, the techniques presented herein are devised to perform an evaluation of a query 14 against content items 22 of various content sets 20 in a manner that also assesses a predicted relevance of the matching of the query 14 to the content items 22. These techniques may be devised to regard the elements of a query 14 not as criteria to be compared with content items 22 in a rote manner, such that every content item 22 matching all criteria in at least a minimal capacity are identified and presented as equally valid search results. Rather, the elements of the query 14 may be regarded as adjectives or “hints” describing the content item(s) 22 that the user 12 wishes to locate. For example, a user may wish to identify content items 22 stored in a computer system relating to a device having particular properties, such as a mobile phone manufactured by a company called “Mobility” and having a 50-centimeter display, a keypad, and of the color black. The user may therefore generate a query 14 comprising the terms “mobility 50 keypad black”. A less sophisticated search algorithm might simply identify every candidate content item 38 matching all four of these tokens in some capacity, and may present the results in an unsorted or arbitrarily sorted manner. However, an embodiment formulated according to the techniques presented herein may endeavor to apply the query according to the implied intent of each element of the query. For example, the number “50” may match at least one aspect of a very large number of candidate items 22, but such matches may have different significance. For example, it may be more likely that the user 12 intended to retrieve a content item 22 describing a phone with a 50-centimeter display or an individual living at 50 Main Street than a document having a file size of 50 kilobytes or a file created 50 days ago. While the latter results may be valid, the former results may have a higher probability of relevance to the intent of the query 14. Accordingly, an embodiment of these techniques may index different content items 22 based not only on a set of identifiers 42, but on different identifier weights 44 of various identifiers 42, indicating the probability that a user 12 searching for the content item 22 may choose to describe or search for it according to that identifier 42. This information may be used to select candidate content items 38 of higher predicted relevance to the user 12, and to adjust the presentation of candidate content items 38 accordingly (e.g., by sorting the candidate content items 38 according to a rank score that is indicative of the identifier weights 44 of the identifiers 42 matching the elements of the query 14).
As one example of the techniques presented herein, among the content items 22 in the exemplary scenario 10 of
In some embodiments, additional techniques may be applied to the calculated rank scores 56 in order to enhance the predictions of relevance. In addition to calculating a rank score 56 based on the identifier weights 44 of the identifiers 44 matching the tokens 52 of the query 12, an embodiment may adjust the rank scores 56 based on various properties of the matching. For example, the rank score 56 for a candidate content item 38 may be increased if the identifiers 42 matching respective tokens 52 are sequentially close together; if the same identifier 42 matches several tokens 52; or if a token 52 matches a large part or all of an identifier 42 (e.g., a higher rank score 56 may be attributed to a match of tokens 52 “joe” and “smith” in an exemplary query 14 with the identifier 42 “Joe Smithy” than “Joe Smithkowski,” in view of the greater percentage of the former identifier 42 matched by the token 52). Various adjustment techniques, some of which are presented herein, or combinations thereof may be applied to adjust the rank scores 56 of various candidate content items 38 in order to improve the relevance predictions of the candidate content items 42 with the query 14.
Still another embodiment involves a computer-readable medium comprising processor-executable instructions configured to apply the techniques presented herein. An exemplary computer-readable medium that may be devised in these ways is illustrated in
The techniques discussed herein may be devised with variations in many aspects, and some variations may present additional advantages and/or reduce disadvantages with respect to other variations of these and other techniques. Moreover, some variations may be implemented in combination, and some combinations may feature additional advantages and/or reduced disadvantages through synergistic cooperation. The variations may be incorporated in various embodiments (e.g., the exemplary method 60 of
A first aspect that may vary among embodiments of these techniques relates to the scenarios wherein such techniques may be utilized. As a first example, these techniques may be applied to many types of devices 18, including workstations, servers, portable computers such as notebooks, and small devices such as smartphones. As a second example of this first aspect, many types of content sets 20 and content items 22 may be indexed and searched in this manner, including many types of user or system data objects, such as files in a filesystem, email messages in an email mailbox, contacts in a contacts database, objects in an object system, database records in a database, images in an image set, and financial entries in an accounting system. As a third example of this first aspect, many types of queries 12 comprising various types of tokens 52 may be received, such as textual tokens, integer or floating-point tokens, queries structured in a logical manner (e.g., with Boolean connectors), and voice queries comprising tokens 52 translated from spoken phonemes. As a fourth example of this first aspect, the content items 22 may be accessible to the device 18 implementing these techniques in many ways, such as a locally stored content set 20 comprising content items 22 stored in a memory component of the device 18, a network-accessible content set 20 comprising content items 22 accessible over a local area network, or a remote content set 20 comprising content items 22 accessible over a wide area network, such as the Internet.
A particular scenario where the techniques presented herein may be particularly useful involves a content set 20 comprising content items 22 of a content item type. For example, a device 18 may store a set of applications, each of which may manage a custom content set 20 comprising a set of content items 22 of a custom content item type. An embodiment of these techniques (e.g., the exemplary system 86 in the exemplary scenario 80 of
Additionally, these techniques may be particularly useful in some scenarios due to the rapid evaluation of a query 14 against a set of content items 22. As one example, these techniques may be applied in the context of suggestions of query results while a user 12 continues to enter the query 14. For example, when the user 12 begins entering a first query 14, a first set of candidate content items 38 corresponding to the first query 14 may be identified and presented to the user 12. However, the user 12 may continue to enter the query 14 (e.g., adding new tokens, removing tokens that are skewing the search results, or modifying or reordering existing tokens). Accordingly, a second query 14 may be identified, and the search results may be altered (e.g., by removing candidate content items 38 that do not match second query tokens that have been added to the second query 14; by adding candidate content items 38 that did not match the first query 14 but that that match the second query 14 due to the removal of one or more first query tokens) and/or reordered (e.g., by re-ranking the candidate content items 38 based on the tokens of the second query 14). A second set of search results may therefore be presented to the user 12 based on the second query 14.
This variation may allow the user to view the adjustments to the search results in near-realtime while entering the query 14; may allow the user 12 to determine how to modify the query 14 to identify intended search results (e.g., by removing query terms that are matching too many unrelated candidate content items 38); and may allow the user 12 to stop entering additional search terms when the query 14 is sufficiently focused or has identified the candidate content item 38 that the user 12 is seeking. For example, a user 12 may enter a first search query comprising a particular set of tokens (e.g., “blue 1957”), and may quickly be presented with a broad list of candidate content items 38. The user 12 may then continue entering tokens 52 comprising additional “hints” for the query 12, such as “blue 1957 car,” thereby narrowing the set of candidate content items 38 to those describing blue automobiles involved with the year 1957, and removing candidate content items 38 not relating to automobiles. The user 12 may then add another hint, such as “blue 1957 car v8,” which may automatically adjust the search results to present a null set of search results (e.g., if the user 12 is misremembering that the car in question was had a v8 engine). The user 12 may then replace the latter token 52 with the new token 52 “v6”, and the embodiment may display a small set of search results satisfying these tokens 52, which may include a candidate content item 38 that the user 12 sought. This adjusting of the candidate content items 38 in response to the inputting of the query 14 may allow the user 12 to tailor the query 14 to the desired intent of the user 12 by rapidly displaying the consequences of adding, removing, or altering various “hints” as to the candidate content items 38 matching the query 14. Those of ordinary skill in the art may devise many scenarios wherein the techniques presented herein may be utilized.
A second aspect. that may vary among embodiments of these techniques relates to the manner of indexing content items 22 according to various identifiers 42. As a first example, many pieces of data that identify the content item 22 may be utilized as identifiers 42, such as a name or title of the content item 22, a location of the content item 22 within a content set 20, a creation date, the name of a user 12 comprising an owner or creator of the content item 22, a content item type, various properties of the contents of the content item 22 (e.g., a summary or set of frequently appearing keywords in a document, or a textual description of an image), various pieces of metadata associated with the content item 22, or other content items 22 to which the content item 22 is related. Additionally, it may be desirable to index respective content items 22 according to all identifiers 42 associated therewith (and assigning at least minimal weight to each identifier 42). Conversely, an application may be selective about the identifiers 42 used to index a content item 22 in the content index 46. For example, in indexing an email message, an application may lexically identify the keywords of the title and body of the message that significantly pertain to the content of the message (such that a user 12 may search for the email message according to such keywords), but may refrain from indexing the message according to other keywords that are only tangentially related to the message (such that a user 12 is unlikely to search for the message according to the keywords). As a second example of this second aspect, the identifiers 42 may be indexed in many ways within the content index 46. For example, the identifiers 42 may be natively stored in the content index 46, may be converted to a standard data type (e.g., an alphanumeric string), or may be stored according to a condensed format (e.g., a hashcode of the identifier 42).
As a third example of this second aspect, the identifiers 42 may be indexed in various portions, in addition to being indexed as a whole identifier. For example, an identifier 42 may comprise several portions of an identifier for which a user 12 may search, such as different portions of a filename of a file (e.g., the file “David's_Report.doc” might be queried by the user 12 as “David”, “Report”, “doc”, “David's_Report”, “Report.doc”, or “David's_Report.doc”). Therefore, a particular identifier 42 for a particular content item 22 may be indexed in several different ways, based on these variances in the ways that a user 12 may search for the identifier 14 in a query 20. Moreover, different identifier weights 44 may be stored with the different identifiers 42 to indicate the relative relevance of a token 52 matching the respective identifier 42 and/or the distinctiveness of the identifier 42 in identifying the content item 22 as distinguished from other content items 22. For example, a content item 22 may be associated with a name having various name components (e.g., a first name, a middle name, a last name, and a suffix), and an embodiment of these techniques may be configured to index the content item 22 by both the name and various name components. Moreover, the different selectivity of different name components may be represented as different identifier weights 44; e.g., an identifier 42 representing a name of a content item 22 may be indexed with a high identifier weight, while name components may be indexed with low identifier weights.
A third aspect that may vary among embodiments of these techniques relates to simple filtering techniques that may be implemented in conjunction with the relevance-based techniques provided herein. As a first example, a user 12 may submit a query 14 specifying a particular content item type of candidate content items 38 to be presented, such as only email messages or only contact records (e.g., the query “email joe smith” may be inferred as restricting the candidate content items 38 to only email messages). As a second example of this third aspect, the user 12 may submit a query 14 including one or more tokens 52 specifying a particular content set 30, e.g., objects in a particular filesystem or in a particular portion thereof (e.g., the query “filesystem joe smith” may be inferred as restricting the candidate content items 38 only to those stored in the local filesystem). As a third example of this third aspect, a query 14 may specify that one or more tokens 52 are to be applied only to particular identifier types (e.g., the query “name joe smith” may be inferred as restricting the candidate content items 38 only to those matching the following tokens 52 in a “name” identifier type, such as the owner of a file, the sender or recipient of an email message, or the first name and/or last name of a contact record). For example, different types of content items 22 may have different sets of identifiers 42, but some identifiers 42 may have a shared semantic (e.g., “Name”, “Title”, or “Date of Creation”) and/or a shared data format (e.g., “email address”, “date”, or “telephone number”). A token 52 of a query 14 may therefore specify that candidate content items 38 have an identifier type of a particular value (e.g., the query 14 “name joe smith” may specify content items 22 having an identifier of semantic type “Name” with a value such as “Joe Smith”; the query 14 “email joe@mail.com” may specify content items 22 having an identifier formatted as email addresses and having the value “joe@mail.com”). In this manner, various tokens 52 of the query 14 may be construed to specify various types of simple filtering that may be applied to the content items 22. Those of ordinary skill in the art may devise many ways of permitting a user 12 to apply a simple filter to a query 14 while implementing the techniques presented herein.
A fourth aspect that may vary among embodiments of these techniques relates to the manner of extracting tokens 52 from a query 14 for application to the content index 46. As a first example, the user 12 may explicitly differentiate tokens 52, e.g., by entering different tokens 52 in a sequence. Alternatively, the user 12 may delineate tokens 52 within a query 14 by various properties, e.g., by separating whitespace characters, such as a space, tab, or carriage return. Some embodiments may also permit the user 12 to specify that several sequences are to be evaluated as a single token, e.g., by enclosing a set of tokens in quotation marks or parentheses.
As a second example of this fourth aspect, an embodiment may apply the tokens 52 to the content index 46 in various ways. As a first such variation, the tokens 52 may be applied to the content index 46 in a particular order; e.g., a token 52 identified as highly selective of a small set of content items 22 (e.g., a long string or an unusual term) may be applied to the content index 46 before a token 52 identified as less selective among the content items 22 (e.g., a short string or a frequent term). As a second such variation, an embodiment may endeavor to suggest and correct possibly typographical errors (e.g., suggesting a replacement of the token 52 “patnet” for the token 52 “patent”). As a third such variation, an embodiment may apply each token 52, as well as a token 52 comprising the entire query 14. This variation may be helpful, e.g., for promoting matching with identifiers 42 that match the entire query 14 or a significant portion thereof.
As a third example of this fourth aspect, the application of tokens 52 to the content index 46 may be adjusted in various ways. In a first such variation, content items 22 may only be selected as candidate content items 38 only if at least one identifier 42 of the content item 22 matches each token 52 of the query 14. This variation may be advantageous for respecting that each token 52 has some semantic value to the user 12, and that a content item 22 cannot be selected as a candidate content item 38 if any token 38 is not matched to the candidate content item 38 in some way. As another variation, highly relevant content items 22 may be included as candidate content items 38 even if one or more tokens 52 of the query 14 do not match at least one identifier 42. This variation may be advantageous, e.g., if a highly relevant token happens to fail to match one or more criteria of the query 14, or if one particular token 52 matches no content items 22 (e.g., a typographical error in a token 52 that matches no identifier 42 of any content item 22 may be disregarded). Alternatively, a proximity adjustment may be calculated and used in searching the content index 46; e.g., if a token 52 such as “patnet” matches few or no identifiers 42 of the content items 22, candidate content items 38 may be selected that include one or more identifiers 42 that are proximate to the token 52, such as those containing the term “patent”.
A fifth aspect that may vary among embodiments of these techniques relates to adjustments to the rank scores 56 of candidate content items 38 in view of other criteria that may be predictiveness of the relevance of the matching of the candidate content item 38 to the query 14. In some embodiments of these techniques, after retrieving the identifiers 42 matching the tokens 52 of the query 42 and calculating a rank score 56 for the associated candidate content items 38 based on the identifier weights 44 stored with such identifiers 42, the rank scores 56 of the candidate content items 38 may be adjusted to improve the ordering of the candidate content items 38 in view of the predicted relevance thereof to the intent of the user 12 in formulating the query 14.
As a first example of this fifth aspect, the rank scores 56 of candidate content items 38 may be computed in view of a particular search context of the query 14. It may be appreciated that different queries 14 may be entered in different search contexts. For example, a first query 14 may be entered in a search control of an email client application; a second query 14 may be entered into a search control of a contacts database; and a third query 14 may be entered into a search control of a filesystem. However, it may be appreciated that the user 12 may choose different tokens of the query 14 differently in view of the search context. For example, if a user 12 enters a query 14 in the context of a name search (e.g., a search initiated in the context of a “To:” line in an email message), candidate content items 38 matching a query 14 on a name-related identifier (e.g., the Sender field of an email message or the Name field of a contact record) may be of higher predicted relevance to the user 12 than candidate content items 38 matching the query 14 on a filesystem-related identifier (e.g., a filename field). Conversely, if the user 12 enters a query 14 in a file-related search context (e.g., attaching an object to an email message), the filename field may be of higher predicted relevance. Accordingly, the search context of each query may be taken into account while inferring the intent of the user 12 and interpreting the query 14. For example, if a query 14 is provided by the user 12 in a search context associated with at least one identifier, the rank scores 56 of various candidate content items 38 may be computed by raising the identifier weights 44 of identifiers 42 matching a token 52 of the query 14 that are also associated with the search context.
As a second example of this fifth aspect, if the candidate content items 38 may be evaluated for popularity (e.g., in the context of content items 22 accessed by a user 12, the frequency with which the user 12 has accessed the content item 22 in the past; and in the context of web search results, based on the number of users clicking through a link to a particular content item 22, or the number of links to the content item 22 on other pages), the contribution of an identifier weight 44 of an identifier 42 may be adjusted based on the popularity of the candidate content item 38. For example, if the popularity of a content item 22 is associated with the likelihood of a user searching for the content item 22, the rank score 52 of the candidate content item 38 may be increased, thereby presenting popular candidate content items 38 as having a higher predicted relevance to the user 12 than similarly weighted but unpopular candidate content items 38.
As a third example of this fifth aspect, the contribution of an identifier weight 44 of an identifier 42 to the rank score 56 of a candidate content item 38 may be increased if a token 52 matches multiple identifier portions of the identifier 42. For example, if the query 14 comprises a particular token 52, an identifier 42 having several instances of this token 52 may be regarded as having a higher predictive relevance than an identifier 42 having fewer or only one instance of this token 52. Accordingly, while calculating the rank scores 56 of respective candidate content items 38, an embodiment of these techniques may be configured to raise the identifier weights 44 of identifiers 42 matching more than one token 52 of the query 14.
As a fourth example of this fifth aspect, a query 14 having multiple tokens 52 specified as a sequence, but that may together match various identifier portions of a particular identifier 42. It may be appreciated that the sequence whereby a user 12 enters tokens 52 in a query 14 may be significant, and that sequential conformity of the identifier portions of identifiers 42 matching the sequence of the tokens 52 may be predictive of the relevance of the associated candidate content item 38 with the intent of the query 14. Accordingly, in this fourth example, the identifier weight 44 of the identifier 42 may be raised if the tokens 52 match the identifier portions in approximately the same sequence. For example, if a second token 52 sequentially follows a first token 52 in the query, the identifier weight 44 of an identifier 42 may be increased if the first token 52 matches a first identifier portion of the identifier 42, and the second token 52 matches a second identifier portion of the identifier 42 that sequentially follows the first identifier portion. In a first such variation, the identifier weight 44 may also be increased in proportion to a proximity of the second identifier portion of the identifier 42 with the first identifier portion; e.g., the magnitude by which the identifier weight 44 is raised increases as the tokens 52 match identifier portions that are closer together within the identifier. In a second such variation, the identifier weight 44 may be particularly strongly increased if the second identifier portion directly sequentially follows the first identifier portion, e.g., if the first token 52 and the second token 52 match with a sequence of directly following identifier portions in the identifier 42, such as a phrase. Additional increases in the rank scores 56 may be made if additional tokens 56 also match according to the sequence of identifier portions in an identifier 42, e.g., four tokens matching four directly sequential identifier portions of a candidate content item 38.
As a fifth example of this fifth aspect, the rank score 56 of a candidate content item 38 may be strongly increased if the identifier 22 fully matches the query 14. For example, a query 14 comprising the tokens 52 “joe smith” may result in the calculation of a strongly increased rank score 56 for a contact record having the name “Joe Smith”. This adjustment may satisfy the intent of a user 12 who happens to enter the full and exact contents of an identifier 42 associated with a candidate content item 38.
As a sixth example of this fifth aspect, a rank score 56 of an identifier 42 may be increased based on a percentage of an identifier portion of an identifier 42 matching the token 52. For example, for a query 14 comprising a token 52 having three characters (e.g., “Kat”), the identifier weight 44 of a first identifier 42 matching the three characters of the token 52 and having an overall length of four characters (e.g., “Kate”), where 75% of the identifier 42 matches the token 52, may be factored into the rank score 56 of the corresponding candidate content item 38 with a higher adjustment than a second identifier 42 matching the three characters of the token 52 but having an overall length of nine characters (e.g., “Katherine”), where only 33% of the identifier 42 matches the token 52.
As a seventh example of this fifth aspect, the rank score 56 of a candidate content item 38 may be increased based on the distinctiveness of the matched identifier 38 with the candidate content item 38 among the content items 22 of the content sets 20; e.g., a comparatively infrequent token 56 that matches a candidate content item 38 may have an adjusted higher identifier weight 44 than a comparatively frequent token 56 that matches the candidate content item 38 but also many other content items 22. Accordingly, the identifier weight 44 of an identifier 42 may be raised inversely to the content item count of content items 22 matching the token 52. For example, for a query 14 comprising the tokens 52 “joe” and “arrington”, the token 52 “joe” may match many content items 22, but the token “arrington” may match only a few content items 22, and may therefore be comparatively highly selective of candidate content items 38. Accordingly, an embodiment of these techniques may raise the rank score 56 of a candidate content items 38 matching the token “arrington” to reflect the selectivity of this matching, as compared with the comparatively less selective matching with the token 52 “joe”. Those of ordinary skill in the art may devise many ways of adjusting the rank scores 56 of candidate content items 38 to improve the predicted relevance of the search results 36 to the intent of the user 12 in formulating the query 14 in accordance with the techniques presented herein.
A sixth aspect that may vary among embodiments of these techniques relates to the presentation to the user 12 of the candidate content items 38 as a set of search results 36 in response to the query 14. As a first example of this sixth aspect, the candidate content items 38 may be simply identified (e.g., as a list of files), may be linked (e.g., as a set of hyperlinks or icon-based shortcuts) for easy access, may be presented as previews (e.g., a set of thumbnails or text excerpts of documents), and/or may be presented to the user 12 (e.g., as a slideshow of images matching the query 14). As a second example of this sixth aspect, the candidate content items 38 are presented sorted according to the rank scores 56, but may also be sorted according to other criteria. In one such variation, where candidate content items 38 have a name, the candidate content items 38 may first be sorted by a name length of the names, and may then be stably sorted according to the rank scores 56. As a third example of this sixth aspect, the candidate content items 56 may be presented along with the identifiers 42 matching the tokens 52 of the query 14. This example may be advantageous, e.g., for presenting to the user 12 some of the rationale for presenting respective content items 22 in the search results 36, particularly for content items 22 where such rationale may not be readily apparent from the other presented information (e.g., it may be unclear why a candidate content item 38 named “Report.doc” is included in the search results 36 for a query 14 comprising the tokens 52 “joe smith”, so the identifiers 42 matching the tokens 52 of the query 14, such as an Author metadata field specifying the name “Joe Smith” or a phrase containing this name embedded in the document, may be presented along with the candidate content item 36). Additionally, the identifier portions of the identifiers 42 that matched the respective tokens 52 of the query 14 may be emphasized in the presentation of candidate content items 38, e.g., by presenting the matched identifier portions in bolded typeface.
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.
As used in this application, the terms “component,” “module,” “system”, “interface”, and the like are generally intended to refer to a computer-related entity, either hardware, a combination of hardware and software, software, or software in execution. For example, a component may be, but is not limited to being, a process running on a processor, a processor, an object, an executable, a thread of execution, a program, and/or a computer. By way of illustration, both an application running on a controller and the controller can be a component. One or more components may reside within a process and/or thread of execution and a component may be localized on one computer and/or distributed between two or more computers.
Furthermore, the claimed subject matter may be implemented as a method, apparatus, or article of manufacture using standard programming and/or engineering techniques to produce software, firmware, hardware, or any combination thereof to control a computer to implement the disclosed subject matter. The term “article of manufacture” as used herein is intended to encompass a computer program accessible from any computer-readable device, carrier, or media. Of course, those skilled in the art will recognize many modifications may be made to this configuration without departing from the scope or spirit of the claimed subject matter.
Although not required, embodiments are described in the general context of “computer readable instructions” being executed by one or more computing devices. Computer readable instructions may be distributed via computer readable media (discussed below). Computer readable instructions may be implemented as program modules, such as functions, objects, Application Programming Interfaces (APIs), data structures, and the like, that perform particular tasks or implement particular abstract data types. Typically, the functionality of the computer readable instructions may be combined or distributed as desired in various environments.
In other embodiments, device 162 may include additional features and/or functionality. For example, device 162 may also include additional storage (e.g., removable and/or non-removable) including, but not limited to, magnetic storage, optical storage, and the like. Such additional storage is illustrated in
The term “computer readable media” as used herein includes computer storage media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions or other data. Memory 168 and storage 170 are examples of computer storage media. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVDs) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by device 162. Any such computer storage media may be part of device 162.
Device 162 may also include communication connection(s) 176 that allows device 162 to communicate with other devices. Communication connection(s) 176 may include, but is not limited to, a modem, a Network Interface Card (NIC), an integrated network interface, a radio frequency transmitter/receiver, an infrared port, a USB connection, or other interfaces for connecting computing device 162 to other computing devices. Communication connection(s) 176 may include a wired connection or a wireless connection. Communication connection(s) 176 may transmit and/or receive communication media.
The term “computer readable media” may include communication media. Communication media typically embodies computer readable instructions or other data in a “modulated data signal” such as a carrier wave or other transport mechanism and includes any information delivery media. The term “modulated data signal” may include a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal.
Device 162 may include input device(s) 174 such as keyboard, mouse, pen, voice input device, touch input device, infrared cameras, video input devices, and/or any other input device. Output device(s) 172 such as one or more displays, speakers, printers, and/or any other output device may also be included in device 162. Input device(s) 174 and output device(s) 172 may be connected to device 162 via a wired connection, wireless connection, or any combination thereof. In one embodiment, an input device or an output device from another computing device may be used as input device(s) 174 or output device(s) 172 for computing device 162.
Components of computing device 162 may be connected by various interconnects, such as a bus. Such interconnects may include a Peripheral Component Interconnect (PCI), such as PCI Express, a Universal Serial Bus (USB), firewire (IEEE 1394), an optical bus structure, and the like. In another embodiment, components of computing device 162 may be interconnected by a network. For example, memory 168 may be comprised of multiple physical memory units located in different physical locations interconnected by a network.
Those skilled in the art will realize that storage devices utilized to store computer readable instructions may be distributed across a network. For example, a computing device 180 accessible via network 178 may store computer readable instructions to implement one or more embodiments provided herein. Computing device 162 may access computing device 180 and download a part or all of the computer readable instructions for execution. Alternatively, computing device 162 may download pieces of the computer readable instructions, as needed, or some instructions may be executed at computing device 162 and some at computing device 180.
Various operations of embodiments are provided herein. In one embodiment, one or more of the operations described may constitute computer readable instructions stored on one or more computer readable media, which if executed by a computing device, will cause the computing device to perform the operations described. The order in which some or all of the operations are described should not be construed as to imply that these operations are necessarily order dependent. Alternative ordering will be appreciated by one skilled in the art having the benefit of this description. Further, it will be understood that not all operations are necessarily present in each embodiment provided herein.
Moreover, the word “exemplary” is used herein to mean serving as an example, instance, or illustration. Any aspect or design described herein as “exemplary” is not necessarily to be construed as advantageous over other aspects or designs. Rather, use of the word exemplary is intended to present concepts in a concrete fashion. As used in this application, the term “or” is intended to mean an inclusive “or” rather than an exclusive “or”. That is, unless specified otherwise, or clear from context, “X employs A or B” is intended to mean any of the natural inclusive permutations. That is, if X employs A; X employs B; or X employs both A and B, then “X employs A or B” is satisfied under any of the foregoing instances. In addition, the articles “a” and “an” as used in this application and the appended claims may generally be construed to mean “one or more” unless specified otherwise or clear from context to be directed to a singular form.
Also, although the disclosure has been shown and described with respect to one or more implementations, equivalent alterations and modifications will occur to others skilled in the art based upon a reading and understanding of this specification and the annexed drawings. The disclosure includes all such modifications and alterations and is limited only by the scope of the following claims. In particular regard to the various functions performed by the above described components (e.g., elements, resources, etc.), the terms used to describe such components are intended to correspond, unless otherwise indicated, to any component which performs the specified function of the described component (e.g., that is functionally equivalent), even though not structurally equivalent to the disclosed structure which performs the function in the herein illustrated exemplary implementations of the disclosure. In addition, while a particular feature of the disclosure may have been disclosed with respect to only one of several implementations, such feature may be combined with one or more other features of the other implementations as may be desired and advantageous for any given or particular application. Furthermore, to the extent that the terms “includes”, “having”, “has”, “with”, or variants thereof are used in either the detailed description or the claims, such terms are intended to be inclusive in a manner similar to the term “comprising.”