As document content becomes increasingly available over wide area networks such as the Internet, indexing and categorizing this content for efficient search becomes more of a challenge for organizations that post content on, for example, web pages. This challenge is likely to become more of an issue as more organizations make information available via electronically searchable databases.
Another challenge with enabling users to electronically search for content is supporting searches for synonyms. Under some approaches, a search engine might receive a given input keyword search, and expand the keywords by identifying synonyms for the keywords at the time that the search is requested. Afterwards, the search engine may perform individual keyword searches for each identified synonym.
While the foregoing approaches may work suitably in some circumstances, there are nevertheless opportunities for improvement, as described further in this application.
Tools and techniques are described for analyzing interactions to identify dissimilar items that may contain synonyms. Methods described herein may retrieve activity records that represent interactions between a visitor and a server-based system, and may identify within the activity records inputs that the visitor provided during the interaction. The methods may identify items within the activity record that are associated with the inputs, and may access additional activity records that also contain the same inputs. The methods may then identify additional items within the additional activity records that are associated with this same input, and may establish similarity ratings for the two items, with the similarity ratings indicating a likelihood that documents respectively associated with the items contain synonyms.
The detailed description is described with reference to the accompanying figures. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. The use of the same reference numbers in different figures indicates similar or identical components or features.
This disclosure is directed to tools and techniques related to detecting synonyms and merging synonyms into search indexes. The description of these tools and techniques begins with an overview of illustrative operating environments for detecting synonyms and merging synonyms into search indexes, presented with
The operating environments 100 may provide at least the server 102 as part of infrastructure that supports one or more resources or sites that are accessible over a network, for example, websites. In some implementations, the website may be a merchant website that offers goods and/or services (collectively and interchangeably referred to as “items”) to customers.
The various goods or services offered by the website may be associated with respective documents. These documents may contain descriptive content that facilities posting information about the documents or items accessible through the website. For example,
It is also noted that the documents and goods shown in
These documents may be provided to the website or to the server 120 on an ongoing basis, as new products are made available through the website.
Turning to the server 102 in more detail, the server may include one or more processors 122 that communicate with one or more instances of computer-readable storage media 124. The processors may read data from or write data to portions of the computer-readable storage media in performing any of the functions described herein. Additionally, the computer-readable storage media may contain software instructions that, when loaded into the processors, cause the server to perform any of the functions described herein.
The storage media 124 may contain one or more software modules that define a search index construction unit 126, which represents a software-based implementation of suitable instructions for processing the documents 110-114 and generating search indexes therefrom. The storage media 124 may also contain one or more software modules that define a synonym recognition unit 128, which represents a software-based implementation of suitable instructions for recognizing synonyms appearing within the documents, and incorporating those recognized synonyms into the search indexes.
In illustrating the storage media 124,
Having described the operating environments 100 in
The search index file may enable searches conducted across a plurality of documents, such as the documents 110-114. These documents may be represented in the search index file by respective data structures, denoted generally by the graphic elements for the documents 110-114 as shown in
While
As shown in
Turning to the data structures for the documents in more detail,
The fields may be populated by a manufacturer of the goods or services represented by the document 110, or by other parties or processes as appropriate in different implementations. In the baseball glove example, information in the different fields 202 may convey the size of the glove, the color of the glove, the material from which the glove is manufactured, the type of the glove (fielder, catcher, or the like), manufacturer name or identifier, a brand name, SKU or UPC codes, or other parameters of interest. Additionally, a field 202 may provide a textual description or title of the goods or services represented by the document 110.
As will be understood, these examples of different fields are given only for ease of discussion, but not to limit implementations of the description herein. Other fields may be included without departing from the spirit and scope of the description.
The fields 202 may contain data or other information, denoted generally in
Continuing with the example of the baseball glove above, contents of “color” fields may include the text “brown,” “dark beige,” “black,” or the like. Contents of SKU or UPC fields may contain unique binary, numeric, or alphanumeric identifiers for the baseball glove. Contents of a description field may characterize the goods as a “baseball glove” or a “baseball mitt,” for example.
In similar manner, other goods or services (e.g., the example goods 106 and 108) may be represented in data structures corresponding to the documents 112-114. More specifically, these data structures may contain fields and contents similar to those shown at 202 and 204 as discussed above.
The search index file may also include one or more search index terms 206, which serve as key fields or indexes that facilitate searching, for example, the content fields 204. In some instances, whoever provides the documents 110-114 may also specify the fields whose contents are used as search terms.
The foregoing discussion pertains to pre-processing that may be performed to prepare for responding to keyword searches submitted to, for example, a website by visitors to the website. However, the discussion now presents a brief example of a search scenario, before returning to additional description of pre-processing techniques.
Returning to the baseball glove example, assume that a visitor to the website submits a keyword search including the terms “baseball glove.” In responding to this search, the website may submit the search terms “baseball glove” to the search index file 120. In turn, the website (or a server provided as part thereof) may compare the input search terms “baseball glove” to the search index terms 206. Assume that a field 202a is a product description field, that the field contents 204a contains the text “baseball glove,” and that these field contents are presented as a search index term 206a. In this example, the website may match the input search terms “baseball glove” to the text “baseball glove” as it appears in the search index term 206a. In this event, the website may retrieve the document (e.g., 110) that corresponds to the matching search index term, and return this document as a response to the query.
Having described the data structures suitable for implementing a search index file in
The search index construction unit 126 may receive the document 302 for indexing, as indicated by the line 304. The search index construction unit may include a threshold comparison component 306, which compares the fields and/or contents of the input document 302 to the fields and/or contents of a plurality of other documents that have already been indexed into the search index file 120.
The existing documents 110a, 112, and 114 may contain fields and/or contents, similar to those shown in
The threshold comparison component 306 performs a preliminary thresholding or filtering on the existing documents 110a, 112, and 114. More specifically, the threshold comparison component may determine which of these existing documents are sufficiently closely related to the input document 302 that terms appearing within these documents may be synonyms for one another. The threshold comparison component may perform this thresholding or filtering operation by comparing the fields and/or contents of the input document (e.g., 302), in turn, with the fields and/or contents of the existing documents (e.g., 110a, 112, and 114). If the documents being compared contain fields that have similar names, types, contents, or the like, then the documents may pertain to subject matter this is closely related, such that synonyms may appear within the documents.
In this manner, the threshold comparison component may capitalize on the proposition described in the example introduced above involving the input document 302 and the existing document 110a. More specifically, if both documents relate to sporting goods, it is more likely that these documents include similar fields and/or contents. Thus, the threshold comparison component may determine how many fields and/or contents are similar between the input document 302 and the existing document 110a.
To make the foregoing threshold determination, the threshold comparison component may receive a threshold parameter as input, denoted generally at 308. This threshold 308 indicates how similar the fields and/or contents of the input document 302 and the existing document 110a are to be, before these two documents are related enough to contain likely synonyms. Put differently, the threshold 308 specifies how similar the fields and/or contents of the existing documents (e.g., 110a, 112, and 114) are to those of the input document (e.g., 302) for the existing document to survive the filtering or thresholding process.
In possible implementations, the threshold 308 may be specified as a percentage, expressing how closely the two documents being compared relate to one another. For example, a threshold value of 75% may indicate that approximately 75% of the fields and/or content within the input document 302 match or are similar to fields and/or content within the existing documents 110a, 112, or 114. For example, the input document 302 and one or more of the existing documents 110a, 112, or 114 may all contain fields that list the colors, manufacturers, brands, types, SKUs/UPCs, or other parameters of the corresponding items. This scenario provides but one example of matching or similar fields between or among the fields of the various documents.
In an example of dissimilarity, one document might contain a field for a relatively esoteric parameter pertaining to a given item. However, the other documents may not contain corresponding fields for this esoteric parameter. This scenario provides but one example of dissimilarity between or among the fields of the various documents.
In another example, the contents of these fields as contained in different documents may be similar or dissimilar. As an example of content similarity, two documents may contain a color field that contains the textual contents “brown.” As an example of content dissimilarity, the respective color fields might contain the textual contents “brown” and “dark beige.”
In some instances, the thresholding process may consider the categories of the items 104n, 106, and 108, as compared to the category of the new item 104a. Those documents corresponding to items that are in the same or similar categories as the new item 104a (e.g., 104n) may be more likely to pass the thresholding process shown in
The threshold 308 may be set by trial or experimentation, whether by human personnel or by automated processes. Raising the threshold may result in fewer existing documents passing the threshold and being output at 310. Those output documents 310 may be more likely to include synonyms with the input document 302. Lowering the threshold may result in more existing documents passing the threshold and being output at 310, but these documents may include more “false positives,” i.e., terms that may appear to be synonyms, but actually are not synonyms.
The thresholding process shown in
Having described components and data flows related to indexing the input document into the search index file with
Block 402 represents receiving an input document (e.g., 302) for indexing into a search index file (e.g., 120). In some implementations, block 402 may include receiving input documents that relate to content posted at a website. In other implementations, the input document may relate to goods and/or services offered through a merchant website, as represented in block 404.
Block 406 represents thresholding an existing document, which is already indexed into the search index file, against the input document. Block 406 may include comparing fields and/or contents of the input document to the fields and/or contents of the existing document, as represented by block 408.
Block 410 represents evaluating whether the existing document is sufficiently similar to the input document that terms appearing in the two documents might be synonyms for one another. Block 410 may include performing a preliminary thresholding or filtering process, examples of which are described above with the threshold comparison component 306.
Continuing with decision block 410, if the existing document passes the threshold evaluation, then the process 400 may proceed to block 414, which represents outputting or identifying the existing document as passing the threshold for similarity to the input document.
From block 410, if the existing document does not pass the threshold evaluation, then the process 400 may proceed to decision block 414. The process 400 may also reach decision block 414 after performing block 412. Decision block 414 represents testing whether the search index file contains any more existing documents to be thresholded against the input document. If not, the process 400 may proceed to an end state 418. However, if the search index file contains more existing documents, then the process 400 may proceed to block 416, which represents selecting a next existing document in the search index file for thresholding against the input document. Afterwards, the process 400 returns to block 406 to repeat the process with the newly-selected existing document.
Having described the process 400 for thresholding the input document while indexing it into the search index file with
As shown in
The synonym recognition unit 128 then processes the fields and/or contents of the input document 302 against the existing documents that survived the preliminary thresholding process shown in
Because the documents 310 survived the preliminary thresholding or filtering process shown in
In more detail, the synonym recognition unit 128 may identify candidate synonyms in the input document 302 by comparing fields and/or contents of that document to the fields and/or contents of the surviving documents 310. More specifically, the synonym recognition unit 128 may identify those portions of the surviving documents 310 that contain terms or phrases that are largely similar to terms or phrases that appear in the input document.
In some cases, there may be differences between terms or phrases appearing in analogous places in the input document and the surviving documents. For example, a field 202a in the input document 302 may be a description field that identifies the goods to which the input document relates. The contents 204a of this field 202a may include the text “baseball glove.” Turning to the surviving documents 310, a field 202x in the surviving document 310x may also be a description field, with the related contents field 204x including the text “baseball mitt.” In this example, the phrases “baseball glove” and “baseball mitt” exhibit some aspects of similarity and some aspects of dissimilarity. More specifically, the term “baseball” appears on both descriptions; however, the terms “glove” and “mitt” differ. As detailed further below, the synonym recognition unit 128 may infer that the terms “glove” and “mitt” are synonyms for one another in the context of sporting goods.
Generalizing from the above example, the synonym recognition unit 128 may recognize how much similarity and dissimilarity exists between the contents appearing in the input document and in a given surviving document. If some level of similarity exists between textual matter appearing in the two documents, then any dissimilar text may be synonyms. The synonym recognition unit may output any such dissimilar portions of the textual matter, as denoted generally as candidate synonyms 502. These candidate synonyms 502 may be processed into a data store, such as the search index file 120.
The synonym recognition unit 128 may employ a threshold 504 to specify how much of the textual matter appearing in the two documents is to be similar, before inferring that the dissimilar textual matter might be synonyms. Like the threshold 308 shown in
Having described components and data flows 500 for identifying candidate synonyms appearing in an input document with
In the example shown in
Turning to the existing documents 310x-z, and recalling previous discussion, the surviving or existing documents 310 may include fields 202x-y and contents 204x-y. The synonym recognition unit 128 may receive these fields 202x-y and contents 204x-y as input, denoted generally at 604.
The synonym recognition unit 128 may execute a process 606 that compares the input 602 to the input 604, and identifies dissimilar aspects of the input 602 that are dissimilar to corresponding aspects of the input 604. If enough of the fields 202 of the input document 302 and the surviving documents 310 have similar contents, then those fields that do not have similar contents might contain synonyms.
As example of the foregoing, assume that a plurality of fields 202a and contents 204a in the input document 302 contain similar information as a corresponding plurality of fields 202x and contents 204x in the existing document 310x. However, assume that the field 202n in the input document corresponds to the field 202y in the existing document 310x, but that the related contents 204n are different than the related contents 204y. For example, the fields 202n and 202y may be color fields, and the contents 204n and 204y may specify the color of the goods to which the documents 302 and 310x apply (e.g., a baseball glove, shoe, or the like). The contents 204n may include the text “brown”, while the contents 204y may include the text “dark beige.”
If the rest of the fields of the documents 302 and 310x are sufficiently similar to one another, then the process 606 may infer that these two documents relate to similar goods. For example, assume that the documents 302 and 310x each include four fields that are common between the two documents (e.g., a brand field, a manufacturer field, an item description field, and a SKU/UPC field), and that contain similar or identical contents. Assume further that the documents 302 and 310x both contain a fifth field that is also common between the two documents (e.g., a color field), but contains dissimilar contents (e.g., “brown” versus “dark beige”). In light of the preponderance of similar fields and field contents between the two documents 302 and 310x, the process 606 may infer that these two documents relate to similar goods. It is noted that any percentage of similar fields may be suitable in different implementations, depending on experimentation, iteration, and past or projected results. Thus, the foregoing scenario is provided only for example, but does not limit possible implementations.
Having made this inference, the process 606 may also infer that the remaining, dissimilar fields contain candidate synonyms. Returning to the example of differing colors, if enough fields are similar between two documents, then the process 606 may infer that the two documents relate to similar (perhaps identical) goods. Thus, the process 606 may infer that the colors specified in those two documents, while dissimilar, are nevertheless synonyms for one another. Thus, the process 606 may infer that “brown” and “dark beige” are synonyms, and may report these colors as candidate synonyms 608.
In the example given in
The synonym recognition unit 128 or, more specifically, the process 606 may be responsive to the threshold signal 504. This threshold signal may indicate how many of the fields are to be similar before inferring that any differing fields are candidate synonyms.
The candidate synonyms 608 may be processed into a data store, such as the search index file 120 described above. As described further below, these candidate synonyms may enable optimized searching operations.
Having described components and data flows 600 for identifying candidate synonyms in a field of the input document with
Block 702 represents comparing an input document (e.g., 302) to a given output document that has survived a preliminary thresholding process, such as the thresholding described in
Block 702 may include comparing respective fields of the input document and the output document, as represented by block 704. Examples of such respective fields that may be compared are shown at 202a and 202x in
Block 706 represents identifying fields and/or contents of fields that are similar between the compared documents. In parallel or in serial with block 706, block 708 represents identifying any fields and/or contents of fields that are dissimilar between the compared documents. Taken together, blocks 706 and 708 may be considered as implementing a second thresholding on the existing document, as represented by block 710. The previous thresholding operation is represented by, for example, block 306 in
Decision block 712 represents evaluating whether the amount of similar content found between the compared documents is sufficient to justify or support an inference that any dissimilar content between the compared documents are candidate synonyms. Block 712 may include using a threshold signal (e.g., 504 in
From decision block 712, if the amount of similar content is not sufficient to justify the inference, then the process 700 may proceed to decision block 716. The process 700 may also reach decision block 716 after performing block 714.
Decision block 716 represents evaluating whether additional documents remain for comparison to the input document. If so, the process 700 may proceed to block 718, which represents selecting another existing document for comparison to the input document. The process 700 then returns to block 702, to repeat the process with the newly-selected existing document.
From decision block 720, if no additional documents remain for comparison to the input document, the process 700 may proceed to end state 720. The process 700 may wait in state 720 for the arrival of another input document for processing.
Having described the process 700 for identifying candidate synonyms in an entire field of the input document with
The synonym recognition unit 128 may include a parser 802, which parses the inputs 602 and 604 into terms that appear within the field content, denoted generally at 804. For example, returning to the “baseball glove”-“baseball mitt” example described above, assume that the field 204a in the input document 302 contains the text “baseball glove,” and the field 204x in the existing document 310x contains the text “baseball mitt.” The parser 802 may process the text “baseball glove” from the input document 302 into the individual terms “baseball” and “glove,” and may process the text “baseball glove” from the existing document 310x into the individual terms “baseball” and “mitt.”
The synonym recognition unit 128 may also include a process 806 that received as input the parsed terms 804. The process 806 may identify similar and/or dissimilar portions of the individual terms 804, and output those terms that are dissimilar as candidate synonyms. The output candidate synonyms are denoted generally at 808. Returning to the example in which the input 602 includes the text “baseball glove”, and the input 604 includes the text “baseball mitt”, the identification process 806 may correlate the terms “baseball” appearing in both input text strings, but then recognize that the term “mitt” differs from “glove”. In this event, the identification process 806 may output the terms “mitt” and “glove” as candidate synonyms 808.
Having described the components and data flows 800 for identifying synonyms within portions of the fields of the input document 302 with
Block 902 represents receiving contents of fields from an input document (e.g., 302) that is to be indexed into, for example, a search index file (e.g., 120). Block 902 may also include receiving contents of fields from at least one existing document that is already indexed into the search index file (e.g., 310x-z).
Block 904 represents parsing textual contents of the input fields as received in block 902. Block 904 may be performed by a parser (e.g., 802), and may include processing an input textual phrase into its individual constituent terms. For example, block 904 may include parsing the phrase “baseball glove” into the terms “baseball” and “glove.”
Block 906 represents identifying any similar terms appearing in the inputs as received from the input document and the existing document. Block 906 may include comparing the terms as received from a parsing process, and locating any terms that appear in both documents. In parallel or serially with block 906, block 908 represents identifying any dissimilar terms appearing in the documents. Continuing the previous example, block 906 may include identifying the term “baseball” as appearing in both of the phrases “baseball glove” and “baseball mitt,” while block 908 may include identifying the terms “glove” and “mitt” as being dissimilar.
Having identified any similar and/or dissimilar terms appearing in the two input phrases, decision block 910 represents evaluating whether the two input phrases exhibit enough similarity to justify inferring that any dissimilar terms are probably synonyms. For example, returning to the “baseball glove-baseball mitt” example above, these two phrases each contain two terms, with one term (“baseball”) occurring in both phrases. In this particular example, this one common term may be sufficient to justify inferring that the dissimilar terms (“glove” and “mitt”) are candidate synonyms.
It is noted that any percentage of similar terms appearing within phrases may be suitable in different implementations, depending on experimentation, iteration, and past or projected results. Thus, the foregoing scenario is provided only for example, but does not limit possible implementations.
Returning to decision block 910, If the two input phrases exhibit sufficient similarity, the process 900 may proceed to block 912, which represents outputting the dissimilar portions of the input phrases as candidate synonyms (e.g., 808). Afterwards, the process 900 may reach end state 914, to await the next iteration of the process 900.
Returning to decision block 910, if the two input phrases do not exhibit enough similarity to justify inferring that any dissimilar terms are probably synonyms, then the process 900 may proceed directly to end state 914.
The blocks 906, 908, and 910 may be viewed as applying a thresholding operation to the terms that make up the content received in block 902.
Having described the process 900 for identifying synonyms within portions of the fields of the input document with
One or more visitors 1002 to a network-accessible resource (e.g., a website) may interact with one or more servers or systems 102 that host the resource.
The system 102 may serve as a gateway that enables the visitors to access information about a set of items. This set of items may be referred to as a “catalog” of items. The system 102 may provide different mechanisms by which visitors may extract a relevant subset of these items for detailed review and consideration.
Another example subsetting mechanism is a merchandising user interface (UI) element 1008, which may conduct a dialog with the visitor to establish a particular context of interest to the visitor. For example, if the visitor is searching for a gift for an intended recipient, the merchandising element 1008 may collect information about the recipient, and based on this information, may recommend a set of one or more items as candidate gifts to the visitor. Thus, in the given context of recommending a gift for this recipient, the merchandising element may extract a subset of items that are relevant to this context. As another example, the merchandising element may recommend particular colors for particular items, or may recommend colors for a given item that match or coordinate with one or more other items. In this manner, the merchandising element may serve as an intermediary between the visitors and various components of the system 102 (e.g., the search engine 1006)
The visitors may provide inputs to the subsetting mechanisms 1004. If the visitor is interacting with the search engine 1006, these inputs may include one or more search terms submitted to the search engine. If the visitor is interacting with a merchandising element, these inputs may include specifications or criteria that the visitor provides to the merchandising element. These specifications or criteria may enable the merchandising element to set an appropriate subsetting or searching context for the visitor.
Turning first to the visitor 1002a, this visitor may provide one or more inputs to the system, with these inputs denoted at 1010a. Examples of these inputs may include search terms provided to a search engine, and/or may include specifications or criteria suitable for setting a search context. The visitor may provide these inputs to locate items for potential purchase or acquisition, for example.
The system 102 may receive these inputs 1010a and generate corresponding results 1012a. These results 1012a may include representations of one or more items, with
The visitor 1002a may then review the representations included in the results 1006a. If any of these search results interest the visitor, then he or she may perform some activity related to the search results of interest, with
The system 102 may store one or more transaction activity records 1016 that indicate a sequence of inputs and responses involving the visitor and the system. The system may store these transaction activity records in, for example, an activity records file 1018.
Turning to the visitor 1002n, this visitor may interact with the system 102 similarly to the visitor 1002a. For example, the visitor 1002n may provide inputs 1010n to the system, and receive results 1012n in response. In the example shown in
In the example shown in
The system may also store transaction history records 1016 related to interactions involving the visitor 1002n, or any other visitors who interact with the system 102. Having described the example transactions involving the visitors in
The activity records file 1012 may receive and store one or more instances of transaction activity records, denoted collectively at 1016.
Turning to the record 1016a as an example, the records 1016 may contain an inputs field 1102 that indicates what inputs the visitor entered to initiate a given transaction. As described above, these inputs may include search terms provided to a search engine, or specifications or criteria provided to a merchandising UI element (e.g., 1008). In the example shown in
The search term field 1102 may be associated with one or more fields, with
A field 1106 may store an indication of an item that the visitor selected and purchased, after having provided the search term stored in the field 1102. A field 1108 may store an indication of how long a visitor browsed information related to a given item, after having provided the search term stored in the field 1102. A field 1110 may store an indication of an item that the visitor placed on a wish list or other similar structure, after having provided the search term stored in the field 1102. A field 1112 may store an indication of any ratings or tagging actions performed by the visitor on a given item.
The fields 1104-1110 provide various examples of the activity 1008a that the visitor 1002a may perform after entering the search term 1004a. However, in providing these examples of such activity, it is noted that other types of activity are possible without departing from the scope and spirit of the description herein. Additionally, implementations of the description herein may populate one or more of the example fields 1104-1110, but need not populate all of these fields in every instance. In the example shown in
Turning to the record 1010n, this record may store information relating to the interaction between the visitor 1002n, and the item 106n selected by the visitor. However, in the interests of clarity,
Having described the fields and contents of the activity records file in
The system 102 is carried forward from previous drawings, and may include one or more processors and one or more instances of computer-readable storage media.
The transaction analysis module 1202 may also define and output a similarity signal 1206. For example, if the transaction analysis module 1202 detects two dissimilar candidate items 1204, then these two candidate items may be associated with respective instances of activity or behavior history that led to the selections of the items by the visitors. The transaction analysis module 1202 may score these instances of activity or behavior to indicate whether the activity positively or negatively correlates the input provided by the visitor to the item selected by the visitor.
As examples of this scoring process, if the visitor provides a search term and afterwards purchases a given item represented in the results, this behavior may strongly correlate the search term with the given item. However, if another visitor provides the same search term, afterwards reviews another item different than the given item, but does not purchase this other item, then this behavior may negatively correlate the search term with the different item. The signal 1206 may indicate this negative or positive correlation, and may modify or augment a similarity rating of items that resulted from textual analysis of documents related to the items (e.g., 110 and 112).
In the example shown in
The synonym recognition unit may also receive the signal 1206 that represents a similarity rating of the candidate items. The synonym recognition unit may consider any similarity in the respective activity or behavior histories of the candidate items 1204. For example, if respective visitors purchased two or more candidate items after the visitors performed similar activities, this may indicate that the candidate items are stronger candidates for synonym analysis. However, if the respective visitors purchased two or more candidate items after the visitors performed dissimilar activities, this may indicate that the candidate items are weaker candidates for synonym analysis. As described above, this activity analysis may modify any textual analysis of documents related to the items (e.g., 110 and 112).
Having described the components of a system for processing the activity records stored in an activity records file, the discussion now proceeds to a description of processes for analyzing transactions to identify different items that resulted from the same search terms and that were selected for activity by different visitors, now presented with
Block 1302 represents traversing an activity records file (e.g., 1012), and retrieving from this file one or more activity records.
Block 1304 represents identifying any search terms associated with an activity record currently under analysis.
Block 1306 represents identifying any items that are associated with activity stored in the instant activity record.
Block 1308 represents traversing the activity records file (e.g., 1012) to locate any other activity records that are keyed or indexed by the same search term as the search term identified in block 1304. In this manner, block 1308 may indicate whether any other visitors performed searches using the same search term.
Evaluation block 1310 represents evaluating whether any other activity records in the activity records file contain the same search term identified in block 1304. If block 1310 evaluates to “true” or “yes”, then the process flows 1300 may proceed to block 1312, which represents accessing one or more activity records that contain this same search term. In other words, if the process flows 1300 take the Yes branch from block 1312, this would indicate two or more items were selected that resulted from the same search terms.
Block 1314 represents identifying any items associated with the activity records accessed in block 1312. Block 1314 may be similar to block 1306, but is performed on the one or more activity records that match the original activity record processed in block 1304.
Block 1316 represents evaluating whether the item identified in block 1306 is different than the item identified in block 1314. If these items are different, then two or more different items were selected by visitors after the visitors searched using the same terms. In this case, the process flows 1300 may take a Yes branch to block 1318.
Block 1318 represents establishing a similarity rating of the two items.
Block 1320 represents adjusting the similarity ratings of the items based on the type of activity that the visitors performed in selecting the items. Different types of activity may have different levels of importance in assessing similarity of items. For example, block 1320 may accord more importance to activity that culminates in actual purchases of the items. Block 1320 may accord less importance to activity that culminates in browsing or viewing the items, but not actual purchases. Block 1320 may include providing indications of these candidate items and related similarity ratings to a synonym recognition unit (e.g., 128). In turn, the synonym recognition unit may factor-in the similarity rating when processing the documents.
Block 1322 represents selecting a next activity record in the activity records file for analysis. Afterwards, the process flows 1300 may return to block 1304, and repeat blocks 1304-1322 with this next activity record. As described above,
Referring to evaluation blocks 1310 and 1316, if the result of either of these blocks is negative, then the process flows 1300 may advance from either of those blocks to before block 1322, as shown in
The tools and techniques shown in
Having described the tools and techniques shown in
The input document 302 to be indexed into the search index file 120 may include at least one field 1002 that is associated with content 1004. Using any of the techniques described previously, the content 1004 may be recognized as containing one or more candidate synonyms. More specifically, the content 1004 may contain synonyms with terms or phrases occurring within the documents 310x and 310y, which are already indexed into the search index file. As such, the search index file may have extracted search terms 1006x and 1006y, respectively, from these previously-indexed documents 310x and 310y, as represented by the dashed lines 1008x and 1008y.
In this scenario, assume that one or more terms or phrases occurring in the contents 1004 are candidate synonyms with terms or phrases that occur in the existing document 310x. These terms or phrases from the input document 302 may be extracted for use as search terms 1008, as represented by the dashed line 1010. However, because these terms or phrases are also candidate synonyms for terms in the document 310x, these terms or phrases may be merged with those terms in the document 310x, as represented by the dashed line 1012. As detailed with
Having described the components and data flows 1000 related to merging detected candidate synonyms into the search index file with
Block 1102 represents indexing an input document (e.g., 302) into a search index file (e.g., 120). Block 1102 may include extracting search terms (e.g., 1008) from certain contents of the input document.
Block 1104 represents evaluating whether any synonyms have been found in the input document. For example, block 1104 may include evaluating whether any candidate synonyms (e.g., 502, 608, 808) have been reported for the input document. If so, the process 1100 may proceed to block 1106, which represents logically merging the synonyms in the input document (e.g., 1004) with any matching synonyms in one or more of the existing documents, such that a subsequent keyword search specifying one of the synonyms will also return all of the merged synonyms. More specifically, block 1106 may include linking synonymous search terms (e.g., 1012) within a data structure that stores the search terms, as represented by block 1108. The data structure may accomplish this linkage using any convenience mechanism that logically connects the synonyms appearing in different documents, e.g., pointers, handles, or other constructs. The search index file described herein (e.g., 120) is but one possible example of such a data structure.
Returning to decision block 1104, if no synonyms were recognized in the input document, then the process 1100 may proceed to block 1110, which represents awaiting the next input document for indexing into the search index file.
Having described the components and data flows 1100 related to merging detected candidate synonyms into the search index file with
The visitors may search for resources on the server-based system by submitting keywords 1206.
In some instances, the server-based system 1204 may be the same as the system 102, which is shown in
In the scenario shown in
Assume, however, that the input keywords 1206 match with the search term 1008. Recall that during the pre-search processing shown in
In this case, the search submission unit 1208 may submit only one search request 1210 for the input keywords 1206, but may still obtain search results 1212 that include any synonyms for the input keywords 1206. However, the search submission unit 1208 accomplishes this result without spawning and executing multiple search requests at search-time. By detecting synonyms during the preprocessing phase, and merging the synonyms in the search index file before search-time, the search submission unit 1208 effectively merges or combines searches across known synonyms ahead of time. In this manner, the search submission unit 1208 may avoid the overhead and search-time delays involved with performing multiple search requests.
Having described the operating environments 1200 with
The operating environment 1300 may include a server-based system 1302, which may be similar to the server-based system 1204 shown in
As described in
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described. Rather, the specific features and acts are disclosed as exemplary forms of implementing the claims.
This application is a continuation-in-part (CIP) of pending U.S. application Ser. No. 11/617,131, filed on 28 Dec. 2006, entitled “Detecting Synonyms and Merging Synonyms into Search Indexes.” The contents of this parent application are incorporated by this reference as if set forth verbatim herein, and the benefit of the filing date of this parent application are hereby claimed to the fullest extent permitted by 35 U.S.C. §120.
Number | Name | Date | Kind |
---|---|---|---|
5675819 | Schuetze | Oct 1997 | A |
6006225 | Bowman et al. | Dec 1999 | A |
6366910 | Rajaraman et al. | Apr 2002 | B1 |
7024416 | Ponte | Apr 2006 | B1 |
7082426 | Musgrove et al. | Jul 2006 | B2 |
7113943 | Bradford et al. | Sep 2006 | B2 |
Number | Date | Country | |
---|---|---|---|
Parent | 11617131 | Dec 2006 | US |
Child | 11694721 | US |