1). Field of the Invention
Embodiments of this invention relate to a data processing system and method that provides improved search data.
2). Discussion of Related Art
The internet is a global network of computer systems and has become a ubiquitous tool for finding information regarding news, businesses, events, media, etc. in specific geographic areas. A user can interact with the internet through a user interface that is typically stored on a server computer system.
Because of the vast amounts of information available on the Internet, users often enter search queries into a search box for processing by a server computer system. The server computer system typically searches a database of information to extract information to provide for the user. Unfortunately, a large amount of information is often provided to the user which can result in the user being overwhelmed. A server computer system can provide search suggestions for refining the search space.
There can be queries for which there are too few or irrelevant results and it is difficult for the user to reword his query to get the right results, hence, this method is useful.
The invention provides a method of data processing including receiving a query and utilizing the query to produce at least one related search suggestion from a data source.
The method of data processing may further include decomposing the query into at least one n-gram which is a subset of the query and processing the at least one n-gram to determine at least one related search suggestion.
The method may further include merging the at least one related search suggestion into a ranked output data set and transmitting the at least one related search suggestion.
The method may further include providing at least one n-gram that is at least a uni-gram, bi-gram, tri-gram or greater.
The method may further include processing of the at least one n-gram to identify at least one of an address, a name, an entity, a word overlap, and a stop-word.
The method may further include processing of the at least one n-gram and comparing at least one valid word from the query with at least one valid word from the n-gram to ensure quality.
The method may further include processing of the at least one n-gram and referring to a database containing data related to associations between n-grams and the at least one related search suggestion.
The method may further include merging and assigning the at least one related search suggestion a first score based on a local score, global score, number of words in the n-gram, and number of words in the query. The local score is the strength of association between n-gram and the related search suggestion. The global score is the strength of the n-gram.
The method may further include merging and assigning the at least one related search suggestion a second score measuring the special properties like entity status of the n-gram which lead to that suggestion.
The method may further include filtering the ranked output data set by comparing the at least one related search suggestion with the query and a higher ranked search suggestion having a higher second score than the at least one related search suggestion.
The method may further include filtering the ranked output data set by separating the ranked output data set into at least one of a narrow category, an expand category, and a names category.
The method may further include wherein the transmitting of the at least one related search suggestion is without categorization.
The method may further include filtering of the at least one related search suggestion including at least one category.
In the method, the filtering may include identifying an important phrase containing an important word within the query to categorize the at least one related search suggestion.
The method may further include the important phrase or word being determined by a ratio between a query word with a lowest web frequency and a query word with a second lowest web frequency.
The method may further include processing the at least one n-gram to determine at least one data result and merging the at least one data result into a ranked output data set.
The method may also further include transmitting a final data set based on the ranked output data set.
The method may further include a data source of n-gram-webpage association generated from query -webpage association.
The method may further include filtering the ranked output data set includes filtering by at least one of block list filtering, name extraction filtering, and channel type filtering.
The invention also provides a system for processing data including a server computer system, a receiving module stored on the server computer system for receiving a query over a network from a client computer system.
The system for processing data may further include a search engine that utilizes the query to extract at least one search result from a data source.
The system may further include a query decomposition module to decompose the query into at least one n-gram which is a subset of the query and a processing module to process the at least one n-gram to determine at least one related search suggestion.
The system may further include a merging module to merge the at least one related search suggestion into a ranked output data set and a transmission module to transmit the search result and the at least one related search suggestion from the server computer system to the client computer system.
The invention also provides a system that may further include a query decomposition module to decompose the query into at least one n-gram which is a subset of the query and a processing module to process the at least one n-gram to determine at least one data result.
The system may further include a merging module to merge the at least one data result into a ranked output data set and a filtering module to filter the ranked output data set to create a final data set.
The system may further include a transmissions module to transmit information from the server computer system to the client computer system, the final data set being used to create the transmitted information. The invention also provides machine-readable storage medium that provides executable instructions which, when executed by a computer system, causes the computer system to perform a method including receiving a query.
In the machine-readable storage medium, the computer system may execute the method further including decomposing the query into at least one n-gram which is a subset of the query.
In the machine-readable storage medium, the computer system may execute the method further including processing the at least one n-gram to determine at least one related search suggestion.
In the machine-readable storage medium, the computer system may execute the method further including merging the at least one related search suggestion into a ranked output data set and transmitting the at least one related search suggestion.
The invention also provides machine-readable storage medium that provides executable instructions which, when executed by a computer system, causes the computer system to perform a method including receiving a query.
In the machine-readable storage medium, the computer system may execute the method further including decomposing the query into at least one n-gram which is a subset of the query and processing the at least one n-gram to determine at least one data result.
In the machine-readable storage medium, the computer system may execute the method further including merging the at least one data result into a ranked output data set and transmitting a final data set based on the ranked output data set.
In the machine-readable storage medium, the computer system may execute the method further including transmitting information from the server computer system to the client computer system, the final data set being used to create the transmitted information.
The invention is further described by way of example with reference to the accompanying drawings, wherein:
The data processing system 20 is first described with respect to
A search engine 30 generating search results 32 is connected with a transmission module 34 which communicates with a plurality of client computer systems 26 over a network 52 where search results 32 can be displayed or communicated to enable user interaction with the search results 32. Search results 32 can be generated by the search engine 30 through referencing a database 36 or any data source. The data source can be any device capable of storing information. The search engine 30 is located on the server computer system 24 but can be located on a remote computer system. The search engine 30 can be of the type found in U.S. application Ser. No. 10/853,552, the contents of which are hereby incorporated by reference.
An initial query 22 is transmitted from the receiving module 28 to a related search suggestion engine 38. The related search suggestion engine 38 contains a query decomposition module 40, a processing module 42, a merging module 44, and a filtering module 46. The merging module 44 creates a ranked output data set 48 which is received by the filtering module 46 and results in a final data set 50. The final data set 50 is received by the transmission module 34 and is transmitted to a client computer system 26 from the server computer system 24. The query 22 can be processed through the search engine 30 and related search suggestion engine 38 simultaneously or in sequence, one after the other. Also, the transmission module 34 may transmit search results 32 and the final data set 50 simultaneously or in a staggered manner through a network 52 to a client computer system 26.
The data base 36 is in communication with both the search engine 30 and processing module 42. It is appreciated that the database 36 can be multiple data sources located on the server computer system 24 or at a remote location.
The related search suggestion engine 38 receives the initial query 22 and decomposes the query 22 into its “components” called n-grams 56 or constituent terms. The n-grams 56 are processed by a processing module 58.
The n-grams 56 are processed 58 into valid n-grams 60 and invalid n-grams 62. The valid n-grams 60 generate related search suggestions 64 (RSS). A related search suggestion 64 is defined as text that is produced and presented to a user so that when the user clicks on the text, a query is processed by a search engine to produce search results. Multiple related search suggestions 64 are generated for each valid n-gram 60; however, it is also possible to generate only one search suggestion 64 per valid n-gram 60.The related search suggestions 64 are merged in a merging process 66 by a merging module 44. The merging process 66 results in a ranked output data set 48 which are filtered through a filtering process 68 by the filtering module 46. The filtering process 68 results in a final data set 50. Thus, the final data set 50 is received by the client computer system 26.
When a search suggestion 64 is selected by a user or client computer system 26, specific information related to the user selection is sent to the database 36. The specific information can contain data concerning which search suggestion the user selected and what n-grams 56 (of the initial query 22) are associated with that selection. Other specific information can be sent to the database 36, such as number of words in the n-gram 56, number of words in the initial query 22, and number of suggestions needed.
In use,
After the initial filtering process 72, the initial query 22 or modified query (if a spelling correction etc. has occurred) can be decomposed into a series of n-grams 56 or constituent terms in a decomposition process 74. Each n-gram 56, according to an embodiment, can be a unigram 76, a bi-gram 78, or a tri-gram 80. However, it is possible to create n-grams 56 containing up to the number of words in an initial/modified query 22. N-grams 56 are a subset of the initial query 22.
The bi-grams 78 and tri-grams 80, according to an embodiment, require all words in the n-gram to be directly adjacent to one another to form the n-gram 56 and are filtered to exclude certain prefixes or stop-words. However, it would be possible to create n-grams 56 by skipping words. For example, referring to
Components or n-grams 56 can contain any or all of the initial query 22 terms, and may optionally be altered for spelling, punctuation, stemming, capitalization, rephrasing, and other standard-text processing manipulations.
The above decomposition is performed by the query decomposition module 40 although it is appreciated that the decomposition can occur in separate modules.
Also, n-grams 56 that are prefixes phrases are eliminated, such as a query 22 containing the words, “Where can I find . . . ”. A prefix list of phrases is provided to filter excessive words that may dilute the effectiveness of finding a search suggestion. Unigram 76 numbers can be eliminated from the processing step 58. For example, the n-gram “100 years” would require the n-gram “100” to be eliminated. The preceding examples are included only for illustration; the inclusion or exclusion of specific n-grams can be controlled by modifying configuration files to allow customized behavior for different applications.
Names are generally defined as proper nouns associated with a person and are identified by a “Names list” or data set. The Names list could also be expanded to include names of places and things as well as persons. Entities are defined on an “Entities list” or data set and include non-name words having special significance or meaning. Entities having special significance will be given a weighted score, as will be later described in more detail. Entities can also include words with no special significance but having highly common group occurrences. For instance, the word “Acura Legend” would be considered an entity, with a weighted score, since it has special significance to a specific type of car. However, the words “abnormal growth” would be considered an entity as well, even though it has no special significance. The words “abnormal” and “growth” have a highly common group occurrence and therefore are considered an entity by association. However, entities with no special significance, such as “abnormal growth”, are not weighted in the scoring of suggestions, as will be later described. In another embodiment, names and entities can be identified algorithmically using entity extraction algorithms well known in the art, or by a combination of algorithms and lists.
If an n-gram 56 has a word overlap with another larger n-gram 56 which is an entity or name, the n-gram 56 will be eliminated. Any n-grams 56 that split apart names or entities are eliminated.
An example of n-gram 56 overlapping with a larger n-gram 56 that is a name or entity would be a query 22 containing the bi-gram “Britney Spears”. The unigram “Spears” is related to a certain type of weapon. The name “Britney Spears” occurs on the “Names list” because she is recognized as a famous pop singer. Because the unigram “Spears” has word overlap with the larger bi-gram “Britney Spears”, “Spears” is identified as being an invalid n-gram 62 and is not used to obtain related search suggestions 64. The above example illustrates one way in which valid n-grams 60 are distinguished from invalid n-grams 62.
Word overlap with another n-gram, that is an entity or name, can be determined, according to an embodiment, through implementing the following logic:
Consider a query: X0 X1 . . . X(N−1)
First dummy words, A, B, and C, D are padded before and after the query to form:
A B X0 X1 . . . X(N−1) C D
The various n-grams 56 needed for evaluation from the query are:
However, the n-grams can be written in a regular pattern as follows:
The n-grams containing dummy words are not going to be used as valid n-grams 60. However, the following pattern emerges:
If an n-gram is a dummy, it cannot be an entity or name. The dummy n-grams are needed so that invalid values are not returned for any of the indices mentioned in e)-f) for n-grams 0, 1, 3 and any n-gram above number of words*3−1.
Another type of n-gram 56 that is analyzed in the splitting process 84 is an address suffix n-gram. Address suffixes, such as “Ave., Pl., Ct., St., Rd., etc.” can be provided on a list or data set for identification in the splitting process 84. An address suffix n-gram, according to an embodiment of the invention, is eliminated if it is recognized as an ambiguous search within the context of the query 22. For example, if a street suffix is present in the query 22 as follows, “V W X Y Z<suffix>M N”, then the following n-gram 56 combinations would be eliminated because street names would get separated from city-state combinations leading to ambiguity in results.
Ambiguous n-gram 56 combinations to be invalidated, involving address suffixes, can be stored in a data set or list for reference during the splitting process 84. Also, ambiguous n-gram combinations having an address suffix and a direction n-gram, such as North, N, East, E etc., can be eliminated by reference to a data set or list. For example, referring to the same example query, “V W X Y Z <suffix> M N”, if X is a direction n-gram, then the following n-gram 56 combinations are eliminated as invalid:
Similarly, using the same example query above, if Y is a direction n-gram, the following known ambiguous combinations would be eliminated or invalidated:
It is appreciated that the same type of ambiguous n-gram combination filtering can be applied beyond street suffixes in other contexts.
N-grams 56 recognized as cities, states, or street names, when compared with a city, state, or street name list, can also be analyzed for valid 60 or invalid n-grams 62. If a city and state n-gram is greater than three words, in an embodiment of the invention, the city and state are split into a combination of unigrams 76, bi-grams 78, and tri-grams 80.
However, if an n-gram 56 is recognized as a city and the adjacent n-gram 56 is recognized as a state, and the combined city and state n-gram is less than three words (a tri-gram 80 or less), the city and state n-gram is not split and is marked as an address entity. If the address entity is not part of a larger entity it will become a valid n-gram 60 and will not be eliminated. Therefore, city and state n-gram combinations less than three words may survive the splitting process 84 and can become valid n-grams 60 which generate search suggestions.
Also, street names would not be separated from city names if they occur adjacent to one another in a query 22 within the tri-gram 80 limit. Splitting the street name from the city name would return erratic search suggestions containing a similar street name in an entirely unrelated city. Therefore, maintaining the n-gram containing the street and city is advantageous because it tends to provide more relevant search suggestions.
Address and Name/Entity Conflict
A situation can occur where the address rules and the Names and Entities lists conflict. Conflicts may occur when an address rule determines an n-gram 56 is invalid 62 but the Entity or Names list determines the n-gram 56 is a valid n-gram 60. Naturally, a conflict may also occur when an address rule determines an n-gram 56 is valid 60 but the Entities or Names list determines the n-gram 56 is invalid 62. The general rule applied in these situations is that entities cannot break higher entities which can be defined by the processing module 42. For example, the query 22 “fred thomas edison new jersey” can be parsed into three n-gram 56 combinations:
If there is a conflict between address entities and name entities, according to an embodiment, both entities will survive and neither will be eliminated. Therefore, “fred thomas edison” will not be eliminated and “edison new Jersey” will not be eliminated even though there is a conflict between the two n-grams.
However, the address rules, according to another embodiment, can allow Names or Entities to be dominant over one another. Address entities can be made take precedent over the Names and Entities list so that the association between “thomas” and “edison” will be broken therefore resulting in the first n-gram 56 combination (listed above) being selected as containing the correct valid n-grams 60. It should be noted that “fred thomas edison” occurs on the Names list but was in conflict with the higher address entity of “edison new jersey”. Because “edison new jersey” can be considered a higher entity, it takes precedent over the Names and Entities list. It is appreciated that, in another embodiment, the Names and Entities list could be defined as a higher entity in the processing module 42 and therefore take priority over address entities. Upon determining all invalid n-grams 62, the remaining valid n-grams 60 can be established in the process 86.
Stop-Word Checking
With respect to a bi-gram 78, if a stop-word is within the valid bi-gram 78, any tri-grams 80 containing the bi-gram 78 must be checked for data. Suppose there is a query 22 containing the elements ABCD. If a valid bi-gram (BC) exists where C is the non-stop-word, then B must be checked to determine whether it is a stop-word. If B is a stop-word, then any tri-grams 80 containing BC must be examined to determine if the tri-gram 80 contains valid data. The tri-grams 80 to be examined in this example are ABC and BCD because they are tri-grams 80 containing the bi-gram BC. If either tri-gram 80 contains related search suggestion data 90 and is a valid tri-gram 80, then the data associated with the bi-gram BC will not be used. The above processing assumes that tri-grams 80 would have higher resolution in finding relevant data and provides the advantage of returning more relevant search suggestions.
For example, suppose a query 22 is entered containing, “if the car is black then”. Suppose that “is black” is identified as a valid bi-gram 78. Assume “black” is a non-stop-word and “is” is identified as a stop-word. Therefore, the tri-grams “car is black” and “is black then” are examined to determine if they contain data. If the tri-grams do contain related search suggestion data 90, such data will be preferred over other data associated with the bi-gram “is black”. Essentially, this processing implements a reverse logic, in that the existence of search suggestion data 90 must be determined to decide which n-grams are valid.
With respect to a valid unigram 76, if a stop-word is adjacent to the unigram 76 (either preceding or succeeding), then the bi-grams 78 containing the stop-word and unigram 76 will be checked for data. For example, suppose there is a query 22 containing the elements BCD. If a valid unigram C exists, then B and D must be evaluated to determine whether they are stop-words because they precede and succeed the unigram C, respectively. If B is a stop-word, then the bi-gram BC will be examined to determine if it contains related search suggestion data 90. If D is a stop-word, then the bi-gram CD will be examined to determine if it contains related search suggestion data 90. If either bi-gram, BC or CD, contains data, then that bi-gram 78 is valid and the relevant search suggestion data 90 will be selected over the unigram, C.
Essentially, for every valid unigram 76 or bi-gram 78, the n-grams 56 containing the valid unigram 76 or bi-gram 78 must be checked for data and will be preferred if data exists. The process of stop-word checking described above can occur in the splitting process 84 according to an embodiment. It is appreciated that the stop-word checking process can occur in a separate process as well. Furthermore, a list of dependent n-grams (resulting from stop-word checking) can be compiled to determine what n-grams should be used in creating related search suggestions 64. In an example, according to an embodiment, stop-word checking can be accomplished by the following logic:
For entities, names, the address rule, and the stop word rule, if a longer valid n-gram 60 contains any search suggestion data 90, the shorter n-gram within the longer n-gram 60 will be eliminated as a source of search suggestion data 90. Generally, longer n-grams are more likely to be rare queries and often contain less data than shorter non-rare n-grams. Shorter n-grams tend to be more popular queries and may return large amounts of irrelevant data.
In an example, according to an embodiment, initial query comparison 94 can be accomplished by the following logic:
Intra-session scoring can also be applied to n-gram 60 to suggestion data 90 indexing. In intra-session scoring, queries further away from the original query in a session are weighted lower. Also, instead of keeping the raw form of data from the sessions for related queries, the query can be normalized and hashed and kept in that form. A separate hash to raw form can be maintained.
The above equation calculates an individual score for each n-gram using a local score which is a number representative of how many users asked a suggestion query in a session, with queries containing a specific n-gram. The global score is based on the n-gram itself. The global score represents the number of users asking all the queries that gave rise to an n-gram. The product of individual Score[suggestion] values for n-grams create a total score for the suggestion as a whole.
The local and global scoring can be defined, in an embodiment, according to the following logic:
If an n-gram is too popular, the result of Score[suggestion] is a larger score which is less desired in the above equation. The local-to-global ratio is adjusted by being multiplied with a second ratio equal to the number of words in an n-gram divided by the number of words in the initial query 22.
Based on the above Score[suggestion] equation, a lower Score[suggestion] ratio indicates a highly desired score. The following score is used in merging the suggestions for all valid n-grams 62 to form a ranked output data set 48:
The above equation includes the weighted scores for entities, as previously described. The equation is defined by the variables e and n. The variable e represents a score related to the number of entities and name n-grams from the initial query 22 which contributed to the suggestion being scored. The variable n represents the total number of n-grams from the initial query 22. The expression
gives weight to the suggestions that came from entities or names as defined on the Entities and Names list. The scoring evaluates the entity or name contributions. It should be noted that the Actual_ratio value is calculated by subtracting Score[suggestion] from a value of one. Therefore, a higher Actual_ratio value is more desired and indicates a higher ranked suggestion. However, as previously mentioned, entities with no special significance having highly common group occurrences (such as “abnormal growth”) are not considered in the above scoring equation and are not given weight.
If there is a tie in scoring between two suggestions using the Actual_ratio score, a tie breaker between two Actual_ratio scores is determined by the equation:
Tie _breaker=1−Product_over _all _ngrams(Score[Suggestion])
The tie breaker equation utilizes the Score[suggestion] value subtracted from a value of one, so that a higher tie breaker score is desired in winning a tie breaker. It should be noted that the Score[suggestion] value excludes any contributions from entities or names as described above and is based purely on the local score, global score, and number of words in the query 22 and n-gram. If a query is an entity,
is zero, hence all suggestions get an actual ratio score of 1, which is not useful. Therefore a tiebreaker is needed. Thus, the possibility of having a tie within the Score[suggestion] value is less likely than having a tie within the Actual_ratio score.
The ranked output data set 48 is received by the filtering module 46. The filtering module 46 filters the ranked output data set 48 in a suggestion filtering process 104 and outputs a final data set 50.
A name extraction enhancement process is possible by extracting names from related search suggestion data 90 and adding the names to the Related Names-category as related search suggestions 64. A related search suggestion 64 would receive a final ranking score, i. Names that are derived from related search suggestions 64 get the same score as the original suggestion. Of course, it can be additive if other suggestions give rise to that name or the name suggestions already exists. If the name comes from multiple suggestions or itself, the scores are added up and resorted. It is possible to extract one word names or block one word names from being extracted.
It should be noted that edit distance can also be used as a factor in determining overlap between suggestions. The above information is utilized to calculate an overlap score between 0 and 1. The result overlap score can be calculated, in an embodiment, according to the following logic:
If a related search suggestion 64 has a maximum overlap greater than 0.9 with another suggestion or initial query 22, it is eliminated because it is too similar to the maximum overlap partner. Also, if the related search suggestion 64 has a synonym in common with the maximum overlap partner and the maximum overlap is greater than 0.45 (0.9/2), the related search suggestion 64 is eliminated.
During the unique word tracking and filtering process 112, unique words are tracked and stored in a location to be referenced to ensure that queries contain unique words. Unique words are defined as words that are not stop-words. In the following filtering process 114, a word novelty filter eliminates suggestions that do not have a unique word. For example, suppose there are four suggestion, A, B, C, and D ranked in order from one to four, respectively. The word novelty filtering process 112 would ensure that suggestion D contains a unique word that does not occur in suggestions ABC. If suggestion D does not contain a unique word (compared to ABC), it is eliminated.
The Narrow category 118 provides the user with the related search suggestions 64 similar to the initial query 22. A suggestion located in the Narrow category 118 can be referred to as a “SIM”. The Expand category 120 enables the user to search alternative queries that may provide desired results beyond the scope of the initial query 22. A suggestion located in the Expand category 120 can be referred to as an “ALT”. It is understood that multiple categories beyond Narrow, Expand, and Names categories can be created related to the n-gram.
For example, suppose a query 22 was entered such as “Where can I find information on Britney Spears and Tom Cruise?”. Because there is more than one name or entity (2 names) within the query 22, the important word must be determined through an n-gram comparison with suggestions existing in the Narrow category 118. If the name “Britney Spears” occurs in the Narrow category 118 three times, and the name “Tom Cruise” only occurs once, then “Britney Spears” will be flagged as the parsing query where the important word can be found.
However, if no data exists in the Narrow category 118, the next process 134 selects the name or entity n-grams as the parsing query. Therefore, in our example, “Britney Spears” and “Tom Cruise” would have been selected as the parsing query to find the important word because both n-grams likely occur on the Names list.
However, if “Britney Spears” and “Tom Cruise” are not found on the Names list or in the Narrow category, then the entire query 22 must be selected 138 as a parsing query for further processing.
After a parsing query is selected 132, 136, 138 for processing, the web frequencies of all words within the parsing query are determined. The lowest (W1) and second lowest (W2)web frequency words are then determined 140. The lowest, W1, and second lowest, W2, web frequency words are compared 142 in a frequency ratio against a predetermined threshold (t):
The predetermined threshold t can be any number defined by the filtering module 46, such as the number four, for example. The variable w1 is the web frequency of the lowest web frequency word, W1, and the variable w2 is the web frequency of the second lowest web frequency word, W2. The frequency ratio (w1/w2) looks to determine if w1 and w2 are within the same order of magnitude. If the frequency ratio is below the predetermined threshold t, then the two words, W1 and W2, are within an order of magnitude and therefore the local frequency of each word must be determined 144. W1 or W2 is selected as the important word by comparing each word's local frequency in suggestion data. The most dominant word prevails which is defined as the word having the highest local frequency within a local suggestion set. The local frequency is the number of suggestions a word occurs in, within a local suggestion set.
However,
Once an important word is determined, all n-grams 56 within the initial query 22 containing that word are determined 148 and thus become important phrases, as shown in the process step 150. After the important words and phrases are determined, suggestions containing the important word or phrase will be categorized 152 as SIM in the Narrow category as shown in
For example, suppose the initial query 22, “New Jersey State Flag” is entered. “New Jersey” occurs in the Narrow category 118 already, in the form of suggestions such as “New Jersey Bird” or “New Jersey Flower”. Therefore, the parsing query chosen is “New Jersey” because it has overlap with the other suggestions in the Narrow category 118. The n-grams with the highest occurrence in Narrow are selected as the parsing query. Therefore, “New Jersey” is selected as the n-gram with the highest occurrence since “New Jersey Bird” and “New Jersey Flower” contains the n-gram “New Jersey”. Then the lowest and second lowest web frequency words are determined within the parsing query. “Jersey” has the lowest web frequency because the word “New” is so common it could be considered a stop-word. Therefore, “Jersey” becomes the important word. Thus, the phrases in the initial query 22 containing the important word would be categorized as important phrases. The initial query 22 “New Jersey State Flag” can be broken into three n-grams: 1) “New Jersey” 2) “State Flag” and 3) “New Jersey State Flag”.
Because options 1) and 3) contain the important word “Jersey” they become important phrases. Thus, “New Jersey” and “New Jersey State Flag” become important phrases. Therefore, any related search suggestions 64 containing an important word or phrase become categorized 146 in the Narrow category 118 as a SIM.
Also, a noise elimination process 156 will eliminate ALT suggestions that are considered “noise” because they are too popular. The “noise” words can be maintained on a list for reference by the noise elimination process 156.
Moreover,
After the bad pattern filter process 164, a block list filtering and channel filtering process 165 can be implemented. A block list can eliminate all related search suggestions 64, eliminate certain suggestions, or replace suggestions with a replacement search suggestion. The block list is loaded by the server computer system 24 which handles the general processing and can find a replacement search suggestion to modify the final data set 50. The block list can be manually created, according to an embodiment of the invention, or the block list may be automatically generated.
Channel filtering is possible by identifying whether a channel is a clean channel or an adult channel in determining what related search suggestions 64 should be modified. For example, if a channel is identified as a clean channel, related search suggestions 64 containing adult content will be invalid. However, if a channel is identified as an adult channel, all suggestions are to be used. It's also possible to channel filter in an image channel.
After the above suggestion filtering process 104 is complete, a final data set 50 of related search suggestions is created and sent to the client computer system 26.
The server computer system 24 has stored thereon a crawler 176, a collected data store 178, an indexer 180, a plurality of search databases 36, a plurality of structured databases and data sources 222, a search engine 30, a search suggestion engine, 38, and the user interface 170. The novelty of the present invention revolves around the user interface 170, the search engine 30, the search suggestion engine 38, and one or more of the structured databases and data sources 222. The crawler 176 is connected over the internet 172A to the remote sites 174. The collected data store 178 is connected to the crawler 176, and the indexer 180 is connected to the collected data store 178. The search databases 36 are connected to the indexer 180. The search engine 30 and search suggestion engine 38 are connected to the search databases 36 and the structured databases and data sources 222. The client computer systems 26 are located at respective client sites and are connected over the internet 172B and the user interface 170 to the search engine 30 and search suggestion engine 38.
Reference is now made to
A user at one of the client computer systems 26 accesses the user interface 170 over the internet 172B (step 188). The user can enter a search query in a search box in the user interface 170, and either hit “Enter” on a keyboard or select a “Search” button or a “Go” button of the user interface 170 (step 190). The search engine 30 then uses the “Search” query to parse the search databases 36 or the structured databases or data sources 222. In the example of where a “Web” search is conducted, the search engine 30 and suggestion engine 38 parse the search database 36 having general Internet Web data (step 192). Various technologies exist for comparing or using a search query to extract data from databases, as will be understood by a person skilled in the art.
The search engine 30 and suggestion engine 38 then transmit the extracted data over the internet 172B to the client computer system 26 (step 194). The extracted data includes URL links to one or more of the remote sites 174. The user at the client computer system 26 can select one of the links to the remote sites 174 and access the respective remote site 174 over the internet 172C (step 196). The server computer system 24 has thus assisted the user at the respective client computer system 26 to find or select one of the remote sites 174 that have data pertaining to the query entered by the user.
The exemplary client computer system 26 includes a processor 198 (e.g., a central processing unit (CPU), a graphics processing unit (GPU), or both), a main memory 200 (e.g., read-only memory (ROM), flash memory, dynamic random access memory (DRAM) such as synchronous DRAM (SDRAM) or Rambus DRAM (RDRAM), etc.), and a static memory 202 (e.g., flash memory, static random access memory (SRAM), etc.), which communicate with each other via a bus 204.
The client computer system 26 may further include a video display 206 (e.g., a liquid crystal display (LCD) or a cathode ray tube (CRT)). The client computer system 26 also includes an alpha-numeric input device 208 (e.g., a keyboard), a cursor control device 210 (e.g., a mouse), a disk drive unit 212, a signal generation device 214 (e.g., a speaker), and a network interface device 216.
The disk drive unit 212 includes a machine-readable medium 218 on which is stored one or more sets of instructions 220 (e.g., software) embodying any one or more of the methodologies or functions described herein. The software may also reside, completely or at least partially, within the main memory 200 and/or within the processor 198 during execution thereof by the client computer system 26, the memory 200 and the processor 198 also constituting machine readable media. The software may further be transmitted or received over a network 154 via the network interface device 216.
While the instructions 220 are shown in an exemplary embodiment to be on a single medium, the term “machine readable medium” should be taken to understand a single medium or multiple media (e.g., a centralized or distributed database or data source and/or associated caches and servers) that store the one or more sets of instructions. The term “machine readable medium” shall also be taken to include any medium that is capable of storing, encoding, or carrying a set of instructions for execution by the machine and that caused the machine to perform any one or more of the methodologies of the present invention. The term “machine readable medium” shall accordingly be taken to include, but not be limited to, solid-state memories, and optical and magnetic media.
One advantage of the above data processing method 54 and system 20 is that related search suggestions 64 can be offered for new or rare queries. New or rare queries may have less reliable search results and the related search suggestions 64 can create a safer fallback option.
Another advantage is that suggestion coverage may increase dramatically over current methods. A significant share of the search engine page previews can be attributed to clicks on related search suggestions 64, so increased coverage should increase page views.
In addition to increased coverage of queries, this method also increases the average number of suggestions per query, applicable to both rare and non-rare queries. The related search suggestions 64 can drive traffic from non-monetized to monetized queries more easily using the above query decomposition method.
An alternative embodiment could apply the above query decomposition method in a general search result context. For instance, search results from a search engine can be processed in the same manner the related search suggestions 64 were processed. The scoring scheme described herein could be applied to query decomposition of search results.
In another alternative embodiment, the query decomposition method can be applied to any query based system such as creating a classification for queries in a system. Other applications measuring any other kind of affinity, such as user-to-user affinity or pick-to-pick relationships, can be measured using the query decomposition method above. Specifically, common query components could be measured. Moreover, a correlation between all queries and picks in a session could be created using the above decomposition method.
In another alternative embodiment, the data processing method 54 can be accomplished without a filtering step 104. The ranked output data set 102 could be transmitted directly to the client computer system 26 without filtering. Moreover, filtering could occur on the client computer system 26 instead of the server computer system 24. Furthermore, different filtering methods and criteria may be applied to different types of suggestions while remaining within the scope of this invention. For instance, more stringent filters may be applied to the Narrow category 118 than the Expand category 120. Also, the data processing method 54 can create only a Narrow category of suggestions while excluding the Names category 166 and the Expand category 120. Many variations in the types of categories to be displayed to the user are possible. For example, a display of search suggestions without any category is possible. In another example, a display of at least one category is possible.
While certain exemplary embodiments have been described and shown in the accompanying drawings, it is to be understood that such embodiments are merely illustrative and not restrictive of the current invention, and that this invention is not restricted to the specific constructions and arrangements shown and described since modifications may occur to those ordinarily skilled in the art.
This application is related to U.S. patent application Ser. No. 10/853,552 entitled “METHODS AND SYSTEMS FOR CONCEPTUALLY ORGANIZING AND PRESENTING INFORMATION,” by Curtis, et al., filed on May 24, 2004, which is hereby incorporated herein by reference.