Within the field of computing, many scenarios involve an application of a query to a data set comprising a set of data entries, such that the data entries matching the selectivity criteria of the query are identified and returned as a set of query results. The query often comprises a set of keywords, which may be structured in many ways (e.g., as a natural-language query, a Boolean query having several criteria organized in a logical framework, or a specific phrase with which matching query entries are associated.) The query may also be generated by and received from many types of sources, including a user who may enter the query as text into a textbox control of a website or application and an automated process that may request, receive, and utilize data entries matching certain criteria.
In some scenarios, the data set may comprise a set of structured data, such as a database comprising a set of records, an extensible markup language (XML) document specifying a set of entities in a well-structured declarative format, and an object library comprising a set of objects having particular properties. In regard to such structured data sets, a query may specify criteria to be applied against one or more attributes of the data set (e.g., one or more attributes of a database table, one or more attributes of the entities of an XML document, or one or more member fields or properties of an object.) For example, in a data set representing people, a query may specify criteria such as “people having the first name of ‘David’, a last name beginning with the letter ‘S’, and an age between 15 and 45 years.” The various attributes specified in this query may be applied against corresponding attributes of the data set (e.g., the first name, last name, and age fields, respectively) in order to identify people who match the specified criteria.
This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key factors or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
Several difficulties may arise when applying a query against a well-structured data set having various attributes. As a first example, the query may not specify an attribute against which a particular field is to be applied; e.g., a data set representing people may be targeted by a query specifying the query term “Louis,” but it may not be clear whether this query term refers to a first name, a last name, or a resident of the city of St. Louis in the state of Missouri in the United States. As a second example, the query may be intended to seek data entries of a particular type, but may include terms that do not precisely describe the particular type; e.g., a data set comprising data entries that represent a set of computers may be targeted by a query specifying “portable” computers, but this term may be validly interpreted in many ways (e.g., workstations that may be easily transported, such as featuring a case with a handle; workstations having integrated components, such as an all-in-one computer built into a display; computers having a comparatively mobile architecture, such as a notebook, netbook, tablet, or palmtop; computers having components that facilitate travel, such as an integrated battery and a wireless or cellular network adapter; notebook computers having comparatively small dimensions and that may fit into small compartments; or lightweight computers that are easily hefted.) Because of the unstructured and possibly ambiguous nature of such queries, it may be difficult to provide query results that meet the intent of the query.
Techniques may be utilized to identify intended meanings of the terms of a query. In particular, techniques may be identified to determine, for a particular query term such as a keyword, the data entries that the query term differentially selects (and excludes) in contrast with queries that do not include the query term. For example, from a historic set of queries received and applied to the data set, a set of query pairs may be identified, where each query pair comprises a “background query” comprising a set of background query terms, and a “foreground query” comprising the set of background query terms along with a foreground keyword. The data entries of the data set that are more often selected when the foreground keyword is included may be identified as potentially relevant to the foreground keyword. Among many such sets of data entries for many query pairs, a shared property in a particular attribute of the differentially selected query results may be identified, and a query predicate may be identified that targets the shared property in the attribute. This query predicate may be associated with the keyword in a keyword map, along with a confidence score (e.g., an estimate of the confidence that the query predicate selects data entries consistently with the intent of the query designer.) In this manner, the prevalent selectivity of a particular keyword over the data entries of the data set may be identified.
The keyword map prepared in this manner may be utilized in the application of search queries to the data set in order to identify query results that have higher relevance to the intent of the search query. For example, when a query is received, the keywords of the query may be translated into the query predicates respectively associated with the keywords according to the keyword map. The translated query may be applied to the data set (with particular query predicates selectively restricting corresponding attributes of the data set), thereby improving the relevance of the query results to the query designer based on inferences about the predicted meanings of the keywords of the query. As another technique, the query may be interpreted as a set of tokens, where the tokens may be partitioned in different ways to achieve different sets keywords (e.g., “small business notebook” may be partitioned into the keywords “small” and “business notebook,” or into the keywords “small business” and “notebook”.) In order to choose among the different keyword sets that may be partitioned from the query, the confidence scores of the various keywords of each keyword set may be aggregated, and the keyword set having a high confidence score, which may represent a high correlation between the selected keyword set and the intended meaning of the query, may be selected.
To the accomplishment of the foregoing and related ends, the following description and annexed drawings set forth certain illustrative aspects and implementations. These are indicative of but a few of the various ways in which one or more aspects may be employed. Other aspects, advantages, and novel features of the disclosure will become apparent from the following detailed description when considered in conjunction with the annexed drawings.
The claimed subject matter is now described with reference to the drawings, wherein like reference numerals are used to refer to like elements throughout. In the following description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the claimed subject matter. It may be evident, however, that the claimed subject matter may be practiced without these specific details. In other instances, structures and devices are shown in block diagram form in order to facilitate describing the claimed subject matter.
Within the field of computing, many scenarios involve the application of a search query to a data set comprising various data entries having a particular structure. As a first example, a relational database comprises one or more related tables, where each table comprises a particular set of fields that confer structure upon records stored in the table, and an SQL query may be applied to the relational database to select records or combinations thereof based on criteria to be applied to the fields of specified tables. As a second example, an object database comprises a set of objects having various fields, and an object query may be applied to the object database to identify objects having fields that match various criteria of the object query.
In many such scenarios, the query may be specified as a set of keywords, which may be matched to the values of various attributes for various data entries of the data set. For example, a natural language search engine may interface with a data set comprising a set of data entries having natural language fields (e.g., a database of news articles comprising a title, a location, a date, an abstract, an author name, and the body of the news article), may accept a natural language query crafted by a user as a set of keywords, and may apply the keywords of the natural language query to the fields of the news article database to identify matching news articles that may be returned as search results. In such scenarios, it may be difficult to identify how the keywords of the search query are to be applied to the various attributes of the data set; e.g., a search query specifying the keyword “Louis” may apply to an article on the topic of a hurricane named Louis, or to an article written by a reporter named Louis, or to articles relating news arising in the location of the city of St. Louis, Mo. Therefore, interpreting the meaning of the query that may have been intended by the user may significantly impact the relevance of the search results to the user, and techniques for improving the identification of such intent may yield search results with improved relevance and value to the user.
While many ways of applying the keywords 20 of the query 18 to the data set 12 may be utilized, it may be appreciated that more sophisticated techniques may be capable of selecting search results that are of greater value to the user who submitted the query. In particular, some techniques may be able to identify the semantics of the query 18 with improved accuracy, such as the intended meanings of the various keywords 20 in relation to the data set 12, and may be able to identify search entries 16 that are more directly relevant to the semantic intent of the query. These techniques may be particularly helpful for satisfying natural language queries, where keywords may have different intended meanings in different contexts. For example, in the exemplary scenario 10 of
The third query 18 in the exemplary scenario 10 of
In these and other scenarios, it may be difficult to apply the query 18 to the data set 12 in a manner that produces a result set 22 of high relevance to the author of the query 18 because it may be unclear how to translate the keywords 20 of the query 18 into the selectivity criteria of the query 18. For example, it may be difficult to select one or more attributes 14 of the data set 12 that are targeted by the keyword 20, or how to evaluate the values of such attributes 14 of various data entries 16 for the keyword 20 (e.g., the qualifying dimensions of a “small” computer.) Additionally, it may be difficult to interpret semantic relationships among keywords 20 of the query 18, e.g., how to interpret the keyword “small” in view of the additional keywords “HiTech” and “laptop.” While it may be possible to identify the semantic intent of such queries 18 in a non-automated way (e.g., by having other users identify the likely semantic intent of various queries 20, such as in a “mechanical Turk” solution, or by having users define query predicates for various search terms), such techniques may be inaccurate, cumbersome, or inefficient.
Alternative techniques for evaluating queries 18 may be devised that may be capable of producing query results 24 of a comparatively high relevance to the author of the query by identifying with improved confidence the intent of respective keywords 20 of the query 18, both in isolation and in the context of the other keywords 20 of the query 18. It may be appreciated that many queries 18 may have been issued against a data set 12, and may be recorded, e.g., in a query set, such as a historic log of queries 18 that have been formulated and applied to the data set 12. An evaluation of these queries 18, and the result sets 22 generated thereby, may reflect some semantic details about the interpretations of keywords 20 that are often included in such queries 18, both in isolation and in the context of other keywords 20 utilized in the same query 18. For example, a query 18 containing the keywords “small computer” may yield a comparatively arbitrary result set 22 if the semantic intent of the keyword 20 “small” cannot be easily determined. However, the result sets 22 of other queries 18 featuring the keyword 20 “small,” such as queries 18 for “small netbook,” “small workstation,” and “small notebook” may yield result sets 22 that confer a fairly specific and consistent meaning upon the keyword “small”—especially if such result sets 22 are compared with the result sets 22 of corresponding queries 18 that omit the keyword, such as queries 18 for “netbook,” “workstation,” and “notebook.” That is, by comparing the result sets 22 of corresponding pairs of queries 18, such as “small netbook” and “netbook,” “small workstation” and “workstation,” and “small notebook” and “notebook,” an automated process may identify a consistent semantic meaning attributed to each instance of the keyword 20 “small” as indicating computers with comparatively low numbers in the “size” attribute. This identification may be utilized both generally, e.g., to determine what the keyword 20 “small” may connote in other queries (such as “small computer”), and also specifically, e.g., to determine what the keyword 20 “small” may connote in the specific queries 18 so formulated (such as the dimensions that constitute a “small” notebook, vs. the dimensions that constitute a “small” workstation.) These identified semantics of the keyword 20 “small” may therefore be applied in the evaluation of other queries. 18. For example, if the keyword 20 “small” is later used in a new context, such as “small server,” the prior evaluations of the keyword 20 “small” in other contexts may suggest a comparison of the dimensions of various computers qualifying as servers and the subset of such computers that have low values in the “size” attribute 14. In this manner, the process of interpreting the intended semantics of various keywords 20 that may be encountered in various queries 18 may be automated, and the resulting determinations may be used to apply such keywords 20 to the attributes 14 of the data set 12 in a manner that produces result sets 22 that are highly relevant to the intent of such queries 18.
If many query pairs 34 are evaluated for a keyword 32 of interest, it may be possible to identify a particular semantic interpretation of the keyword 32 as a query predicate 44 that applies the inferred selectivity criteria to the data set 12, as well as an indication of the consistency of this inference.
While the exemplary scenario 50 of
Still another embodiment involves a computer-readable medium comprising processor-executable instructions configured to apply the techniques presented herein. An exemplary computer-readable medium that may be devised in these ways is illustrated in
The techniques presented herein may be devised with variations in many aspects, and some variations may present additional advantages and/or reduce disadvantages with respect to other variations of these and other techniques. Moreover, some variations may be implemented in combination, and some combinations may feature additional advantages and/or reduced disadvantages through synergistic cooperation. The variations may be incorporated in various embodiments (e.g., the exemplary method 70 of
A first aspect that may vary among embodiments of these techniques relates to the scenarios where such techniques may be utilized. As a first example, queries 18 translated and applied as disclosed herein may be applied to many types of data sets 12, such as relational databases, object libraries or collections, declarative documents formatted in various ways (such as according to an Extensible Markup Language (XML) schema), flat files, and sets of resources. As a second example of this first aspect, the data stored within such data sets 12 may represent many concepts, such as sets of real-world or virtual resources or structured bodies of information. As a third example of this first aspect, the queries 18 applied to such data sets 12 may be specified in many ways, including natural language queries, Boolean queries, or field-specific queries that are to be applied to particular attributes 14 of the data sets 12. Similarly, the query predicates 44 may be specified and used in many ways, such as query fragments specified in a structured language query (SQL) or XPath query language, or as references to particular attributes 14 of the data set 12 and different constraints to be applied thereto. As a fourth example of this first aspect, the query pairs may be manually generated, or may be mined from many types of query set 42 storing queries 18 including query pairs 34 regarding a particular keyword 32 of interest, including a historic log of queries previously submitted by users, a fabricated query set created by an administrator of the data set 12 to populate the keyword map 48, and an automatically generated set of queries 18 that might be submitted by users of the data set 12. Those of ordinary skill in the art may select many scenarios wherein the techniques presented herein may be utilized.
A second aspect that may vary among embodiments of these techniques relates to the manner of identifying one or more selectivity criteria while comparing query results 24 of the result sets 22 of the queries 18 in a query pair 34 for a keyword 32 of interest. Because this identification leads to the inference of semantics (both in isolation and in context) of respective keywords 20, the manner of performing this identification may significantly affect the accuracy of the inference and the resulting relevance of the query results 24. In general, it may be advantageous to utilize statistical techniques for identifying consistent factors that differentiate the query results 24 of a foreground query 36 and a background query 38 of a query pair 34. In particular, artificial intelligence techniques may be trained and utilized to identify differences, such as an artificial neural network or a genetic algorithm. Alternatively, some statistical techniques may be adept at identifying such differences, as well as calculating the confidence scores 46 of the identified selectivity criteria.
As a first example of this second aspect, the comparisons may be performed in many ways. In a first such variation, the comparison may identify one or more attributes 14 of the query results 24 of the foreground query 36 that happen to include the keyword 32 of interest, and these attributes 14 may be compared with the corresponding values of the attributes in the query results 24 of the result set 22 of the background query 38. In a second such variation, the query results 24 of the result set 22 of the foreground query 36 may be compared to identify consistent traits or patterns; the query results 24 of the result set 22 of the background query 38 may be compared to identify consistent traits or patterns; and the identified consistent traits or patterns of each result set 22 may be compared to identify differences between the queries 18 of the query pair 34. In a third such variation, the values of all attributes 14 of each query result 24 of the result sets 22 maybe compared, either in isolation or in combination, to identify patterns that may exhibit differences between the query results 24 of the result set 22 of the foreground query 36 and the query results 24 of the result set 22 of the background query 38. Those of ordinary skill in the art may devise other ways of comparing the result sets 22 of the foreground query 36 and the background query 38 of the query pair 34 while implementing the techniques presented herein.
A second example of this second aspect relates to the identification of selectivity criteria relating to categorical keywords, which may specify various options within a categorical attribute. A categorical attribute of a data set 12 comprises an attribute 14 for which valid values are constrained to a small set of categories, each represented by a keyword 20. For example, in the exemplary scenarios illustrated in
In evaluating such categorical attributes, it may be advantageous to identify the selectivity criteria distinguishing the query results 24 of a foreground query 36 and a background query 38 of various query pairs 34 using an entropy or divergence calculation that identifies the magnitude of the differential probability distribution of the result sets 22. For example, where at least two keywords 20 comprising categorical keywords representing categorical values of a categorical attribute of the data set 12, the confidence scores 46 for respective categorical keywords may be computed according to a divergence computed between attribute values of results generated by the foreground queries 36 and the background queries 38 of the query pairs 32 identified in a query set 42 for the categorical keyword. One such computation that may be utilized in this role is the Kullback-Leibler divergence. This computation may be implemented for the techniques presented herein according to the following mathematical formula:
In this mathematical formula:
A represents the categorical attribute;
v represents a categorical value;
e represents a data entry included in the data set;
Se represents the data set comprising the data entries e;
Sf represents the data entries e selected from the data set Se as query results of the foreground query of the query pair;
Sb represents the data entries e selected from the data set Se as query results of the background query of the query pair; and
p(v, A, S) represents a probability distribution of the categorical value v appearing within the categorical attribute A in the data set S, computed according to a mathematical formula comprising:
This mathematical formula may be utilized to compute the magnitude and statistical significance of the divergence between the query results 24 of the foreground query 36 and the query results 24 of the background query 38 of a query pair 34. A greater divergence may indicate a higher correlation of the categorical values of the categorical attribute with the keyword 32 of interest, and may promote the selection of one or more selectivity criteria that encapsulate the semantic intent of the keyword 32 in various queries 18.
Several variations in the mathematical formula may be devised (e.g., portions of the calculation may be implemented in different ways to promote faster or more efficient computation of the mathematical formula on various devices.) As one such variation, it may be appreciated that errors may arise if the background query 38 includes zero query results 24, which may result in an attempted division by zero. Therefore, the confidence scores 46 of the categorical keyword may be computed according to this mathematical formula of divergence only for query pairs 34 where the background query 38 comprises at least one query result 24.
A third example of this second aspect relates to the identification of selectivity criteria relating to numeric keywords, which may specify various numeric values within a numeric attribute. A categorical attribute of a data set 12 comprises an attribute 14 for which valid values represent numbers, such as physical measurements, performance or capacity metrics, prices, or dates. For example, in the exemplary scenarios illustrated in
In evaluating such numeric attributes, it may be advantageous to identify the selectivity criteria distinguishing the query results 24 of a foreground query 36 and a background query 38 of various query pairs 34 using a calculation that identifies the magnitude of the differential probability distribution of the numbers in the respective result sets 22. For example, where at least two keywords 20 comprising numeric keywords representing numeric values of a numeric attribute of the data set 12, the confidence scores 46 for respective numeric keywords may be computed according to an earth mover's distance computed between attribute values of results generated by the foreground queries 36 and the background queries 38 of the query pairs 32 identified in the query set 42 for the numeric keyword. This computation may be implemented for the techniques presented herein according to the following mathematical formula:
In this mathematical formula:
A represents the numeric attribute;
e represents a data entry included in the data set;
Se represents the data set comprising the data entries e;
Sf represents the data entries e selected from the data set Se as query results of the foreground query of the query pair;
Sb represents the data entries e selected from the data set Se as query results of the background query of the query pair;
vi represents a numeric value within numeric attribute A;
d(vi, vi) represents a measure of dissimilarity between the query results selected from the data set having a numeric value vi for the numeric attribute A and the query results selected from the data set having a numeric value vj for the numeric attribute A;
fij represents a flow computed between optimizing the earth mover's distance the data entries e selected from the data set Se as query results of the background query of the query pair, computed such that:
wherein:
and
fij* represents an optimal flow computed for the foreground queries Sf and the background queries Sb for the numeric values of the numeric attribute A.
This mathematical formula may be utilized to compute the magnitude and statistical significance of the divergence between the query results 24 of the foreground query 36 and the query results 24 of the background query 38 of a query pair 34. A greater divergence may indicate a higher correlation of the numeric values of the numeric attribute with the keyword 32 of interest, and may promote the selection of one or more selectivity criteria that encapsulate the semantic intent of the keyword 32 in various queries 18.
A fourth example of this second aspect relates to the identification of selectivity criteria relating to textual keywords, which may specify various text strings within a textual attribute. A textual attribute of a data set 12 comprises an attribute 14 storing a set of strings, and each keyword 20 may specify a full string or a substring that is stored in the textual attribute for one or more data entries 16. For example, in the exemplary scenarios illustrated in
In evaluating such textual attributes, it may be advantageous to identify the selectivity criteria distinguishing the query results 24 of a foreground query 36 and a background query 38 of various query pairs 34 using a calculation that identifies the magnitude of the differential probability distribution of the numbers in the respective result sets 22. For example, where at least two keywords 20 comprising textual keywords representing textual values of a textual attribute of the data set 12, the confidence scores 46 for respective numeric keywords may be computed according to the ratio of the frequency with which the textual keyword appears in the textual attribute for the query results 24 of the foreground query 36 to the frequency with which the textual keyword appears in the textual attribute for the query results 24 of the background query 38. This calculation may count the total number of appearances of the textual keyword in the values of the textual attribute, or may count the number of textual attributes featuring at least one appearance of the textual keyword. The calculation may also scale the counting of the textual keyword by various factors (e.g., attributing a higher significance to the presence of the keyword earlier in the “Description” value of the textual attribute than to later appearances of the keyword in the same textual attribute.)
Additional variations of this fourth example of this second aspect relate to the application of a textual keywords against the data set 12 when it is not clear which attribute 14 the textual keywords are oriented to target. For example, the textual keyword may include an unusual term that does not often appear in the attributes 14 of the data entries 16 (or that does not appear often enough to identify a sufficient set of query pairs 34 for the keyword 32), or a recently added term that may be included in queries 18 but that does not yet appear often in the data set 12. In these and other scenarios, it may be advantageous, upon determining that a keyword 20 represents neither a categorical keyword (e.g., a valid value in any categorical attribute) nor a numeric keyword (e.g., a valid numeric value in any numeric attribute), an embodiment may be configured to associate the keyword 20 in the keyword map 48 with a query predicate 44 that applies a textual restriction to at least one textual attribute of the data set 12. For example, the evaluation of keywords 20 for the data set 12 in the exemplary scenarios of
As an additional variation of this fourth example of this second aspect, the evaluation of textual keywords may be facilitated by the use of a dictionary, which may identify the attributes 14 against which a particular textual keyword may appear and the query predicates 44 formulated therefor. For example, an administrator of the data set 12 may choose to identify a set of keywords 20 that have known meanings, or at least known selectivity criteria within the data set 12. These identified keywords 20 may be stored in a dictionary as dictionary keywords, along with an indication of the intended meanings. An embodiment may, while evaluating various keywords 20 according to query pairs 34, determine whether the keyword 32 has a defined meaning according to the dictionary. This definition may be included in the identification of the selectivity criteria associated with the keyword 32, and the generation of a query predicate 44 that may be stored in the keyword map 48 associated with the keyword 32. In this manner, the meanings identified by the administrator may be included in the evaluation of the keyword 32, and may be encoded in the keyword map 48 for use in translating queries 18 for application to the data set 12. In a first variation, the dictionary keyword may be associated in the dictionary with a query predicate (such as a SQL fragment) that is to be used to translate instances of the dictionary keyword identified in queries 18 to be applied to the data set 12. In a second variation, the dictionary keyword may be associated in the dictionary with one or more attributes to which the keyword 20 likely relates, and on which an embodiment is to focus while comparing the query results 24 of the foreground query 36 and the background query 38 of a query pair 34.
In order to evaluate a textual keyword, the device 126 illustrated in the exemplary scenario 120 of
A third aspect that may vary among embodiments of these techniques relates to the manner of applying the evaluative techniques presented herein to evaluate different types of keywords 20 to determine the meaning of such keywords 20. It may be appreciated that different embodiments may differently apply such evaluation techniques to the query pairs 34 for various keywords 32, and that some applications may have advantages (e.g., in accuracy, scalability, and/or computational efficiency) as compared with other applications.
More particularly, the exemplary scenario 130 of
In the exemplary scenario 130 of
While
As a second example of this third aspect, during the evaluation of the values of a particular attribute 14 for the query results 24 of various queries 18 in a query pair 34 for a keyword 32, the device 126 may be configured to invoke all of the keyword evaluators 132, and to select the query predicate 44 having the highest confidence score 46 among all invoked keyword evaluators 132. However, the invocation of each keyword evaluator 132 may be computationally costly, and if a particular keyword evaluator 132 returns a particularly high result (reflecting a high degree of correlation), an alternative embodiment may conserve computing resources by forgoing or terminating the invocation of the other keyword evaluators 132, thereby conserving computing resources and improving the performance of the evaluation.
As a third example of this third aspect, the device 126 may endeavor to populate the keyword map 48 only with query predicates 44 for which the confidence score 46 are acceptably high. For example, it may be appreciated that some keywords 32 may not have a consistent or determinable meaning, and the result sets 24 of the foreground queries 36 and background queries 38 of respective query pairs 34 for the keyword 32 may differ only in arbitrary ways, leading to low confidence scores 46. This may arise, e.g., where the keyword 32 comprises a generic term, such as “computer,” which may by happenstance appear in the natural language “Description” attributes for some data entries 14 but not others, thereby leading to query pairs 34 having only arbitrary differences. As a first variation of this third example, an embodiment may store the query predicate 44 and the confidence score 46 in the keyword map 48 only if the confidence score 46 is acceptably high, e.g., if the confidence score 46 exceeds a confidence score threshold. Moreover, the confidence score threshold may be adjusted relative to various factors, such as the number of query pairs 34 evaluated for the keyword 32; e.g., a somewhat lower confidence score 46 may be acceptable if resulting from the evaluation of many query pairs 34, but may not be acceptable if only a few query pairs 34 are available for the keyword 32. Additionally, it may be advantageous to normalize the confidence score 46 for the keyword 32 respective to the adjusted confidence score threshold (e.g., such that respective confidence scores 46 reflect the number of query pairs 34 evaluated in determining the confidence score 46). As a second variation of this third example, the embodiment may, upon failing to identify a query predicate 44 with an acceptably high confidence score 46, associate the keyword 32 with a default attribute, such as the “Description” attribute 14 in the data set 12 illustrated in the exemplary scenario of
A fourth aspect that may vary among embodiments of these techniques relates to the manner of translating a query 18 into a translated query 52 using the keyword map 48. As a first example, depending on the nature of the query predicates 44 stored in the keyword map 48, the translated query 52 may be generated in various ways. In a first such variation, if the query predicates comprise SQL fragments. For example, if keyword 20 “HiTech” is associated with the keyword predicate 44 “brand=‘HiTech’”, and the keyword 20 “light” is associated with the keyword predicate 44 “weight <7.0”, then the translated query 52 may be translated from the query “light HiTech” as the following SQL query: “select * from Computers where (weight <7.0) and (brand=‘HiTech’)”.
As a second example of this fourth aspect, an embodiment may examine the query predicates 44 to identify advantageous combinations thereof. As a first such variation, if a particular attribute 14 is targeted by two or more query predicates 44, it may be advantageous to combine these query predicates 44 in an inclusive manner. For example, a query “HiTech Pyramid laptop” may lead to the selection of query predicates 44 “brand=‘HiTech’” and “brand=‘Pyramid’”. Because no data entry 16 is likely to satisfy both query predicates 44, this query 18 is likely to fail to return any query results 24 if these query predicates 44 are combined with a logical AND connector. However, it may be inferred that the author of the query intended to query for laptop computers manufactured by either HiTech or Pyramid. Thus, an embodiment of these techniques may identify that both query predicates 44 target the same attribute 14, and may translate these query predicates 44 into the translated query 52 with a logical OR connector. As a second such variation, a query predicate 44 that targets a numeric attribute 14 may specify this query restriction in various ways, such as a numeric range (e.g., the keyword 20 “light” might be translated as the query predicate 44 “weight <7.0”.) Alternatively, such a query predicate 44 may be translated as an order, such that data entries 16 that are closer to a particular value are presented higher in the query results 24 of the query 18 than data entries 16 that are farther away from the particular value (e.g., the keyword 20 “light” might be translated as the query predicate 44 “order by [weight] asc”, thereby ordering the query results 24 in order of lowest weight to highest weight.)
As a third example of this fourth aspect, the identification of keywords 20 in a query 18 may be performed in various ways. As a first example, the query 18 may simply be partitioned in various ways (e.g., by partitioning based on whitespace), and each token may be identified as a keyword 20 to be translated into the translated query 52 using the keyword map 48. While this simple technique may be advantageous where each keyword 20 comprises a single word, it may produce undesirable results for keywords 20 that involve multiple words. For example, this technique may fail to partition the query 18 “small business laptop” into the likely intended keywords 20 “small business” and “laptop” (indicating a laptop computer suitably configured for use in a small business environment), but may instead partition the query 18 into the keywords 20 “small,” “business,” and “laptop,” thereby querying the data set 12 for laptop computers that are small and have some connection with business (which may be construed as an arbitrary modifier or a stop word), leading to inaccurate search results. Instead, the query 18 may be parsed with reference to the keyword map 48, which may facilitate the partitioning of the tokens 62 of the query 18 into a set of keywords 20 having a high aggregate confidence score 66, thereby suggesting the contextual combination of tokens 62 coincident with the inferred intent of the author of the query 18. The exemplary scenario 60 of
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.
As used in this application, the terms “component,” “module,” “system”, “interface”, and the like are generally intended to refer to a computer-related entity, either hardware, a combination of hardware and software, software, or software in execution. For example, a component may be, but is not limited to being, a process running on a processor, a processor, an object, an executable, a thread of execution, a program, and/or a computer. By way of illustration, both an application running on a controller and the controller can be a component. One or more components may reside within a process and/or thread of execution and a component may be localized on one computer and/or distributed between two or more computers.
Furthermore, the claimed subject matter may be implemented as a method, apparatus, or article of manufacture using standard programming and/or engineering techniques to produce software, firmware, hardware, or any combination thereof to control a computer to implement the disclosed subject matter. The term “article of manufacture” as used herein is intended to encompass a computer program accessible from any computer-readable device, carrier, or media. Of course, those skilled in the art will recognize many modifications may be made to this configuration without departing from the scope or spirit of the claimed subject matter.
Although not required, embodiments are described in the general context of “computer readable instructions” being executed by one or more computing devices. Computer readable instructions may be distributed via computer readable media (discussed below). Computer readable instructions may be implemented as program modules, such as functions, objects, Application Programming Interfaces (APIs), data structures, and the like, that perform particular tasks or implement particular abstract data types. Typically, the functionality of the computer readable instructions may be combined or distributed as desired in various environments.
In other embodiments, device 162 may include additional features and/or functionality. For example, device 162 may also include additional storage (e.g., removable and/or non-removable) including, but not limited to, magnetic storage, optical storage, and the like. Such additional storage is illustrated in
The term “computer readable media” as used herein includes computer storage media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions or other data. Memory 168 and storage 170 are examples of computer storage media. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVDs) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by device 162. Any such computer storage media may be part of device 162.
Device 162 may also include communication connection(s) 176 that allows device 162 to communicate with other devices. Communication connection(s) 176 may include, but is not limited to, a modem, a Network Interface Card (NIC), an integrated network interface, a radio frequency transmitter/receiver, an infrared port, a USB connection, or other interfaces for connecting computing device 162 to other computing devices. Communication connection(s) 176 may include a wired connection or a wireless connection. Communication connection(s) 176 may transmit and/or receive communication media.
The term “computer readable media” may include communication media. Communication media typically embodies computer readable instructions or other data in a “modulated data signal” such as a carrier wave or other transport mechanism and includes any information delivery media. The term “modulated data signal” may include a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal.
Device 162 may include input device(s) 174 such as keyboard, mouse, pen, voice input device, touch input device, infrared cameras, video input devices, and/or any other input device. Output device(s) 172 such as one or more displays, speakers, printers, and/or any other output device may also be included in device 162. Input device(s) 174 and output device(s) 172 may be connected to device 162 via a wired connection, wireless connection, or any combination thereof. In one embodiment, an input device or an output device from another computing device may be used as input device(s) 174 or output device(s) 172 for computing device 162.
Components of computing device 162 may be connected by various interconnects, such as a bus. Such interconnects may include a Peripheral Component Interconnect (PCI), such as PCI Express, a Universal Serial Bus (USB), firewire (IEEE 1394), an optical bus structure, and the like. In another embodiment, components of computing device 162 may be interconnected by a network. For example, memory 168 may be comprised of multiple physical memory units located in different physical locations interconnected by a network.
Those skilled in the art will realize that storage devices utilized to store computer readable instructions may be distributed across a network. For example, a computing device 180 accessible via network 178 may store computer readable instructions to implement one or more embodiments provided herein. Computing device 162 may access computing device 180 and download a part or all of the computer readable instructions for execution. Alternatively, computing device 162 may download pieces of the computer readable instructions, as needed, or some instructions may be executed at computing device 162 and some at computing device 180.
Various operations of embodiments are provided herein. In one embodiment, one or more of the operations described may constitute computer readable instructions stored on one or more computer readable media, which if executed by a computing device, will cause the computing device to perform the operations described. The order in which some or all of the operations are described should not be construed as to imply that these operations are necessarily order dependent. Alternative ordering will be appreciated by one skilled in the art having the benefit of this description. Further, it will be understood that not all operations are necessarily present in each embodiment provided herein.
Moreover, the word “exemplary” is used herein to mean serving as an example, instance, or illustration. Any aspect or design described herein as “exemplary” is not necessarily to be construed as advantageous over other aspects or designs. Rather, use of the word exemplary is intended to present concepts in a concrete fashion. As used in this application, the term “or” is intended to mean an inclusive “or” rather than an exclusive “or”. That is, unless specified otherwise, or clear from context, “X employs A or B” is intended to mean any of the natural inclusive permutations. That is, if X employs A; X employs B; or X employs both A and B, then “X employs A or B” is satisfied under any of the foregoing instances. In addition, the articles “a” and “an” as used in this application and the appended claims may generally be construed to mean “one or more” unless specified otherwise or clear from context to be directed to a singular form.
Also, although the disclosure has been shown and described with respect to one or more implementations, equivalent alterations and modifications will occur to others skilled in the art based upon a reading and understanding of this specification and the annexed drawings. The disclosure includes all such modifications and alterations and is limited only by the scope of the following claims. In particular regard to the various functions performed by the above described components (e.g., elements, resources, etc.), the terms used to describe such components are intended to correspond, unless otherwise indicated, to any component which performs the specified function of the described component (e.g., that is functionally equivalent), even though not structurally equivalent to the disclosed structure which performs the function in the herein illustrated exemplary implementations of the disclosure. In addition, while a particular feature of the disclosure may have been disclosed with respect to only one of several implementations, such feature may be combined with one or more other features of the other implementations as may be desired and advantageous for any given or particular application. Furthermore, to the extent that the terms “includes”, “having”, “has”, “with”, or variants thereof are used in either the detailed description or the claims, such terms are intended to be inclusive in a manner similar to the term “comprising.”