In most cases, web sites configure their entry pages to be broadly attractive to a large number of users and provide ways for the customer to navigate to their desired item by “drilling down” through sections of the site by browsing or by entering keyword or item number queries. There are two main disadvantages to this. First of all, a visitor to a site may be unaware that it actually carries the product that they are interested in and so do not bother to search. And second, keyword searches and drill-down navigation can be an inefficient process that causes potential customers to give up before they find the product on the web site.
Some web sites display “recommended” items. Typically, these recommended items are the same for all visitors and simply reflect products that are currently popular or are complementary to an item being viewed. In some cases, the recommendations may be targeted to specific users. Such targeting can be based on observations of the user's behavior when interacting with the site (e.g., products viewed, products purchased, and/or product ratings), possibly in combination with the behavior of other users who have looked at, purchased, and similarly rated the same (or similar) items.
Embodiments of the invention relating to both structure and method of operation may best be understood by referring to the following description and accompanying drawings.
Web sites often serve content relating to items or entities that a user may be interested in. In this specification, such items or entities will be referred to as “potentially interesting items” (“PIIs”). Examples of potentially interesting items include, but are not limited to: products or services offered for sale, discussed, reviewed, or supported; off-line businesses or other entities discussed, linked to, or described; organizations such as charities, colleges and universities, or mailing lists, that may be joined or contributed to; athletic teams for which scores and schedules may be provided; geographic locales for which vacation, real estate, weather, or governmental information may be provided; subjects about which descriptions or news articles may be provided; and content objects such as specific articles, reports, manuals, programs, data sets, photographs, video files, or audio files. Within this specification, reference to any of these should be construed as applying to all.
Embodiments of systems and methods disclosed herein enable Web sites to customize their recommendations to users based not merely on a user's browsing behavior on that site but on other sites that talk about or offer similar potentially interesting items. Review sites (or other news sites) can push reviews, stories, or ads that are relevant for shopping for products or services the user has shown an interest in. User profiles are developed on a system that is not affiliated with a particular web site. Accordingly, web sites can receive information regarding a user's interests without any direct collaboration with the other sites the user interacts with. Users can remain anonymous to the site being customized. Interest in specific PIIs, generic PIIs, and PII categories can be inferred from a user's previous activity on the same or other websites. Extra network traffic is not created to infer the potentially interesting items and only minimally impacts the user's system performance.
Referring to
Customization logic 112 can be implemented by a third party customization service provider and can include logic that can be executed by processing unit 106 to build target sets 114 corresponding to potentially interesting items, observe a user's browsing behavior and note when targets in the target set 114 are detected, and obtain targets relevant to a particular user and use the targets to customize a web site when the user invokes the web site. Targets can be associated with potentially interesting items, and can be text strings, bigrams, numbers, or other suitable information. There may be many targets associated with the same PII, and there may be more than one PII associated with a target. The term “content object” can refer to any form of digital content, such as news articles, photographs, movies, or product reviews, regardless of format (e.g. text, Microsoft Rich Text Format, Adobe PDF, Apple QuickTime movie, or Adobe Flash). A content provider is either a web site, or a third party that is used by a web site to fill in content in the final web page presented to the user, such as an external ad agency. No restriction to the claims is implied by these examples.
Workstation(s) 102 can include browser program 116 to allow users to communicate with processing unit 106 and various web site servers 118 over network 104. Browser program 116 can also provide a graphical user interface that is presented on display device to allow a user to interact with and view information from web site servers 118. Examples of suitable browser programs 116 are Internet Explorer from Microsoft Corporation (www.microsoft.com) and Firefox from Mozilla Corporation (www.mozilla.com), among others.
Web site servers 118 can be accessed by workstations 102 and processing unit(s) 106 via network 104. Web site servers 118 can host and implement electronic commerce sites that allow a user to view web pages 122 that render information on goods and/or services for sale to be displayed to a user. Web pages 122 may also allow the user to view additional detail, and order and pay for selected goods/services. Web site servers 118 can also host manufacturer/supplier web sites, independent third party review or catalog web sites, or other type of web site that provides information regarding various potentially interesting items available. Accordingly, web site servers 118 can maintain and provide a list of character strings 120 that uniquely identify the PIIs for which information is available.
The content and layout of the web pages 122 can be specified with a mark-up language, such as Hyper-text Mark-up Language (HTML). Based on the subset of matches between text in PII text strings 120 for a particular web page 122 and target sets 114, as well as text matches found as the user visits other web sites, a profile 124 of the matches can be created for the user by customization logic 112 taking into account time elapsed since a particular match was detected as well as the number of times a particular match is noted, as further described herein. Notably, customization logic 112 can accumulate information from a variety of web pages 122 to generate user profiles 124. A text string can be at least a portion of a text source such as at least one of: a title of web content, a markup language element of the web content, a header associated with a request for the web content, a keyword associated with the web content, text to be presented upon viewing the web content, and text to be presented in a contrastive manner upon viewing the web content. The term “contrastive manner” refers to text that is presented in a manner that is different from the rest of the text, such as text that is in bold font, italics, blinking, in a different color, highlighted, or any other suitable format for drawing attention to the text or presenting the text in a manner that is different from other text.
Although profiles 124 are shown in processing unit 106, profiles 124 can reside on workstation 102, processing unit 106, and/or a remote database (not shown) and accessed via network 104. Customization logic 112 and target sets 114 are provided by a party that is unaffiliated with the parties that provides the web pages 122 hosted by web site servers 118. This allows target sets 114 and profiles 124 to include information from many web sites that are unaffiliated with one another and are provided by different parties instead of just one or more web sites that are affiliated with or provided by a particular party. A party, also referred to as a content provider, may be an individual, an organization, or other entity that provides web content presented to users via web site servers. In this disclosure, web sites are considered unaffiliated with one another when the web sites do not share content with one another, and information presented to a user and/or received from a user on one web site is not shared with the other web site(s). Further, unaffiliated web sites may be operated by different ownership entities, for example, GOOGLE®, YAHOO!®, and CRAIGSLIST®.
Building the Target Set
Referring to
Process 202 can include obtaining one or more strings per product. The strings can be provided directly by the manufacturer/supplier and are often used by sellers' web sites. An example of a product string is “HP SL4778N 47-Inch 1080p MediaSmart LCD HDTV”. Such product strings may be solicited from suppliers, online stores, and/or may be provided as part of a partnership agreement between the web site and an entity providing customization service. The strings from sellers' web sites may also be augmented by strings from other sources such as encyclopedic web sites like IMDB (for movie titles), and lists of products from manufacturers' web sites, among others. The descriptive string may be a “proper name” of the product that is used to construct the title and/or headers for the web site's page dealing with the product.
In some cases, the product strings used in a particular web site may be found by “scanning” the web site's pages and noting the titles or other potentially important text. In such cases, it may be useful to process the titles, headings, or page text to exclude text that is essentially invariant over many pages. Such text may often be boilerplate such as the name of a store, navigation menus, or a “department” within the site. Such text may also include hosted advertising, links to other products, or customer- or user-supplied comments. Such exclusion may be made in many different ways. In some embodiments exclusion may based on models learned via machine learning techniques or rules generated by people The exclusions may be made based on the content of the included text or based on its position within the text. For sufficiently important sites, it may be worth manually constructing XPATH expressions or other forms of automatic rules for processing pages to accurately extract the names of products.
In some cases, the list of targets can include product names. In other embodiments, the product names may be more complex and annotated with other information such as an identifier (e.g., ISBN, SKU, or URL) that specifically identifies the product to a particular web site. Other information may include the type of product, the manufacturer, the price range (e.g., “high end”, “budget”), and an indicator of a generic product, among others. Such information may be hierarchical. For example, a product string may be increasingly specific: “electronics”, “home entertainment”, “TV”, “plasma”, “39-47 inch” and “HP SL4778N”. The hierarchy may be provided by the web site (and, therefore, specific to the web site) or may be provided by the customization service provider and shared among web sites. In the latter case, it may be necessary to obtain a mapping between the web site's class hierarchy and the customization service provider's hierarchy.
Process 204 can process the list of text strings to make it more likely that products will be noticed when a user's web browsing is monitored. First, the strings will likely be normalized, e.g., by removing punctuation and whitespace, converting letters to lowercase, mapping HTML entities (e.g., “&” mapped to “&”), and converting accented characters to canonical form (or mapping them to unaccented characters). Text deemed “noise” (e.g., parentheses or brackets) may be removed or separate entries may be created with and without noise text. “Stopwords” such as “the”, “a”, and “of” may be removed. American/British spelling variants (e.g. “colour” for “color”) and known-common misspellings may be inserted. Other normalization techniques can be used in addition to or instead of the aforementioned techniques.
Process 206 can include extracting substrings of interest from the strings by using one or more suitable techniques such as running a multi-word window over the strings, comparing strings for substrings in common from the same web site as well as with strings from other web sites, and/or other suitable techniques. With respect to common substrings, customization logic 112 can determine an “edit distance” that detects similarity between two strings, allowing for the insertion, deletion, transposition, and/or replacement of words to allow variations in naming the same product between store web sites. Customization logic 112 can also attempt to determine or infer whether parts of a string represent descriptive attributes like size and color rather than the product name. If so, such parts can be removed or moved to another part of the string. Color determination can be made, for example, by noting that the same text (e.g., “(red)”) appears on unrelated products or by consulting a dictionary of such targets.
Once a list of target strings (the target set 114) has been determined for a particular user, process 208 can include creating a representation of the target set 114 for use when monitoring the user's behavior. In some embodiments, the target set 114 may be stored directly in a database. In other embodiments, performance can be improved by creating a compact data structure that can be transmitted to a user's workstation to efficiently determine strings that are in the target set 114 as the user is browsing web sites. To create a compact data structure, process 208 can include computing a hash of each of the strings. For example, the hash (or “hash code”) of each word and then the hash of each subset of contiguous words, as further described in the section “Noting Products Viewed” hereinbelow. One suitable hash technique is described in U.S. patent application Ser. No. ______ (Attorney Docket No 200802054-1) entitled “Systems and Methods for Fast Text Feature Extraction For Classification and Indexing”. Note that some or all of the process of normalizing the strings can take place while the hash code is being computed. Further, other suitable techniques for compacting the data can be used. A match between the text strings and the target strings can be detected using a moving window on the text of the web page content and comparing the hash codes against those of the target strings. For example, if a moving window of three words were used, a candidate hash code would be generated and compared based on the first through third words, then a second candidate hash code would be generated based on the second through fourth words, and so on. The term “moving window” can be a limited region of the text of the page that is tested for a match. This window is iteratively moved to successive (possibly but not necessarily overlapping) positions in the text, testing for a match at each position.
Noting Products Viewed
Referring to
Process 302 monitors a user's interactions with web sites and the content of web pages accessed by the user, and identifies web page code of interest, such as names and/or other identifiers of products or services. For example, when the source text is HTML code, process 302 can detect a name/identifier of a product or service present in a “title” tag in HTML code, which is text that is typically displayed on the title bar of the web page on the user's browser window. Process 302 can detect information of interest from other code for the web page such as the text in an “h1” header tag, which is an HTML element for the first-level heading of a document, the layout of a web page, and/or text rendered in large or bold fonts on a web page. Process 302 can also distinguish between code of interest on the web pages and code that is “uninteresting”, for example, framing information on the web pages such as ads, comments, or links to other products that are typically not of interest and exclude such code from the text to be checked. The web pages can be generated by web sites that are unaffiliated with or independent of one another.
Once text to be checked has been extracted, process 304 determines whether the text matches any terms (also referred to as “targets”) in target set 114 for the web site. Matches may be detected immediately or multiple strings may be stored for later processing, either at a given time or when the host computer running process 304 has available processing cycles. The processing may take place on the user's workstation, and/or the strings may be transmitted (one at a time or in batch) to a remote location for processing.
To detect matching targets, substrings of the title can be checked. In some embodiments, possible subsets of contiguous words in the string are considered after removing stop words. In other embodiments, a maximum string length may be imposed, either the longest naturally-occurring target in target set 114 or an explicitly-imposed bound, e.g. all subsequences up to 12 words long. Other techniques further described herein may be used to narrow the number of substrings checked for matches.
If target sets 114 are stored in a database, each remaining substring can result in a database query. As an alternative, a compact representation of the hash codes of the target strings can be maintained and the hash of each substring can be computed. If the hash is stored in the data structure, process 304 can determine whether the substring matches a target in target set 114 even though the match may be a false positive result. Typically there is a trade-off between the false positive rate and the amount of space the data structure consumes, as well as the amount of time required to determine that a substring is or is not contained in the space.
In some embodiments, target set 114 can be maintained as a sorted array of 40-bit (i.e., 5-byte) hash codes. Smaller hash codes, (e.g. 4-byte (32-bit) hash codes) can be used but may have a higher false positive rate of reported matches. Larger hash codes (e.g. 6 bytes) can provide a significant improvement and allow increased amounts of data in the target sets while maintaining a low false positive rate. Although the number of bytes used to store hash codes can be extended indefinitely, the hash computation becomes more expensive when more than 8 bytes are used, and for any extension the amount of space consumed increases. To process a string, the string can first be converted to an array of hashes corresponding to each word in the string with stop words removed. Stop words can by detected and removed using a table of stop word hash codes. Then for each possible starting position in the array, a hash code is computed for each possible subsequence of hash code in the array by successively combining hashes by a shift and XOR. Each subsequent candidate hash corresponding to a normalized substring of the title can be compared to entries in target set 114 to determine whether there is a match.
The lookup table for target set 114 may be kept as a dense, sorted array of hashes and the lookups performed by means of a proportional variant of a binary search. That is, rather than choosing a probe point in the midpoint of the active section of the table, process 304 can make use of the fact that the hash routine can produce essentially uniformly-random numbers to choose as the probe point according to the following Equation 1:
Iterating and setting “max” or “min” to just beyond the probe point until either the value at the probe point matches the target or max and min collide. Using Equation 1 a match is detected (or noted to not be detected) by iteratively probing the target set with the probe point based on the magnitude of a candidate hash code (h(t)). Empirically, Equation 1 can require an average of approximately four probes to find a match and another half probe on average to find a miss for tables consisting of millions of entries.
Other representations of target set 114 can be used including normal hash tables.
Once a match has been found as being, for example, from word 4 to word 7 of the title, the substring of the title that the match corresponds to is noted. The substring can be normalized or raw including all ignored symbols and stop words. After the title is processed, the set of matches may be reduced to a subset of “best matches”, a single best match, or a set of “good enough matches” by considering the number of words matched and the length of the string. In some embodiments, there can be metadata that indicates that some targets are more important than others, and this information can be used in making the decision of whether a match has been found.
Based on the subset of matches for a particular web site, and matches found as the user visits other web sites, process 306 can generate or modify a profile 124 (
Process 300 can obtain further texts and detect matches with further targets associated with further potentially interesting items associated with further categories. A subset of categories to include in the first profile can be generated based on the further matches. For example, process 300 can determine that a first item is associated with several categories (e.g., “electronics”, “home theater”, “televisions”, “LCD televisions”, “40-inch televisions”, “1080p televisions”, “items costing between $700 and $1,000”) and the other items are associated with other categories, where there may be overlap between the categories associated with different items. From the set of categories, process 300 can determine that the appropriate categories to describe the user's interest are “LCD televisions” and “1080p televisions”, as other televisions viewed may be in different size and price ranges. Perhaps based on the other items, process 300 might narrow the categories to “40-inch televisions” and “42-inch televisions”. Such a technique can be used to categorize and subcategorize other types of items, such as information on books the user has viewed.
In some embodiments, a first profile identifies products, services, organizations, subjects, and/or content objects of interest to the user. A second user profile can be used to identify a category based on the user's first profile. For example, the first profile can be used to determine the specific number and type of item a user has viewed whereas the second profile can be used to provide more general information about the user's interest.
Process 306 may be performed on the user's workstation 102 (
One goal of customization logic 114 is to identify products that a user is actively shopping for, in addition to noting that the user has viewed pages that deal with the product (e.g., retail store pages or review sites). It may also be desirable to be able to infer that the user has not already purchased the product or a substitute for the product. Process 306 can allow the inferred interest in a particular product to decay over time, since if there is a flurry of shopping behavior for a particular product or product category and then the activity ceases, it can be inferred that the user is no longer interested in buying that product. However, especially for recent interest, it may be desirable to notice when a user makes an on-line purchase. Purchases can be detected, among other ways, by noting that the user has followed a link whose accompanying text or URL indicates a purpose to add something to a shopping cart or that the resulting page indicates that something has been added or immediately purchased. In such cases, the product matched on the immediately preceding page can be inferred to have been purchased, and the product matched and any products that appear to be similar should probably be removed from the profile. Removing the product from the profile can prevent continually offering advertisements or other information for products/services that were once of interest but are no longer needed. Alternately, the product/service can be marked as ‘recently purchased’ and accessories appropriate to the product may be promoted by participating online stores.
In some embodiments, process 306 may not only note that a product has been viewed but also to try to extract the price of the product from the viewed web page. Price information can be used in several ways. For example, the store web site being customized may be able to dynamically modify its offering price based on the knowledge of competing prices the customer has seen, even if it is not revealed where the customer saw them. Second, a web site may refrain from showing the user a product that it carries if it knows that the user has already seen a better price. Further, the knowledge of the prices seen for products in a category may allow the web site to decide which products in a category to display.
In some cases, the same target may appear associated with multiple disjoint categories. For example, the same product name may refer to a book, a movie, a CD, and a video game, and other coordinated merchandising. When such a product is matched, process 306 may infer the product category from the web site based on the name of the web site and/or by other product matches on the web site. This is one instance in which it might be useful to try to extract matches from the entire web page, including other products recommended. Note that if category information is not being used, this is needless—the product name is simply the product name.
Process 306 can also allow the user to control their profile. Examples of such control might include (1) a “pause button”, to allow a user to indicate that the products or web pages viewed should be included in the user's profile, (2) an option to exclude products in certain categories or matching certain patterns, and/or (3) the ability for a user to view their profile and explicitly remove items, optionally with the further ability to permanently prevent the product from being added to the profile in future browsing.
Customizing a Web Site
Referring now to
The profile can take many forms. In some embodiments, the profile includes strings that were matched, perhaps in normalized form. In such a case, the web site can use its own search facility to identify likely products that match those strings. In cases in which the profile can associate targets with categories, process 404 can prune the profile to matches of targets in categories for which the web site has products. Although some examples herein pertain to the web site being a store, similar features can be included when the web site is a product review or product news site or any other web site that may have product-related content. For sites that have products in multiple categories, process 404 can identify the categories of the targets matched to prevent spurious recommendations of unrelated products, such as, for example, a book whose title happens to match the name of a car being viewed. However, some matches in different categories may be appropriate to present, such as merchandise related to a particular product such as a toy, book, CD, or DVD.
In some embodiments, rather than including the strings, process 404 can include the web site's own product identifiers and categories in the profile, as provided by the web site when the target set 114 (
Process 406 can use any of various techniques to customize the web page. In some embodiments, process 406 can take the form of altering the set of products proffered as recommendations on the initial page the user sees or kept in a side-bar on pages as the user browses. In some cases, specific products are not recommended but the user is immediately directed to a web page for the relevant “department” of the web site. Alternatively, navigation to the relevant department is made more visually obvious.
In some embodiments the web site is the provider of the content used to customize the web page. In alternative embodiments, the web site can request personalized content from an external content provider. In some such embodiments, the web site may forward the profile to the external content provider. In other such embodiments, the web site may forward the identifier for the user to the external content provider and the external content provider may use the identifier to obtain the profile from the profile server.
In some cases, the web site may want to be more proactive in drawing users. In such cases, process 406 may display information when it is inferred that the user is looking at a product page. Such information can include a link to the product (or category) on the store's web site and may include pricing information. The display may appear in a pop-up window or a stable section of the web page display. The information can also be provided in an RSS feed, an e-mail advertisement, or other suitable communication to the user. Accordingly, process 406 can include allowing users to customize the information that is sent to them, such as allowed sources, contact means, and content.
The various functions, processes, methods, and operations performed or executed by the system can be implemented as programs that are executable on various types of processors, controllers, central processing units, microprocessors, digital signal processors, state machines, programmable logic arrays, and the like. The programs can be stored on any computer-readable medium for use by or in connection with any computer-related system or method. A computer-readable medium is an electronic, magnetic, optical, or other physical device or means that can contain or store a computer program for use by or in connection with a computer-related system, method, process, or procedure. Programs and logic instructions can be embodied in a computer-readable storage medium or device for use by or in connection with an instruction execution system, device, component, element, or apparatus, such as a system based on a computer or processor, or other system that can fetch instructions from an instruction memory or storage of any appropriate type.
In
Additionally, workstations 102, processing unit 106, and servers 118 can be embodied in any suitable computing device, and so include servers, personal data assistants (PDAs), telephones with display areas, network appliances, desktops, laptops, or other computing devices. Workstations 102, processing unit 106, servers 118, and corresponding logic instructions can be implemented using any suitable combination of hardware, software, and/or firmware, such as microprocessors, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuit (ASICs), or other suitable devices.
Workstations 102, processing unit 106, and servers 118 can include memory devices 108, although memory device 108 is only shown in processing unit 106. Logic instructions executed by workstations 102, processing unit 106, and servers 118 can be stored on a computer readable storage medium or devices 108, or accessed by workstations 102, processing unit 106, and servers 118 in the form of electronic signals. Workstations 102, processing unit 106, and servers 118 can be configured to interface with each other, and to connect to external network 104 via suitable communication links such as any one or combination of T1, ISDN, or cable line, a wireless connection through a cellular or satellite network, or a local data transport system such as Ethernet or token ring over a local area network. Memory devices 108 can be implemented using one or more suitable built-in or portable computer memory devices such as dynamic or static random access memory (RAM), read only memory (ROM), cache, flash memory, and memory sticks, among others. Memory device(s) 108 can store data and/or execute customization logic 112, target sets 114, profiles 124, browser program 116, product/service strings 120, and information associated with web pages 122.
The illustrative block diagrams and flow charts depict process steps or blocks that may represent modules, segments, or portions of code that include one or more executable instructions for implementing specific logical functions or steps in the process. Although the particular examples illustrate specific process steps or acts, many alternative implementations are possible and commonly made by simple design choice. Acts and steps may be executed in different order from the specific description herein, based on considerations of function, purpose, conformance to standard, legacy structure, and the like.
While the present disclosure describes various embodiments, these embodiments are to be understood as illustrative and do not limit the claim scope. Many variations, modifications, additions and improvements of the described embodiments are possible. For example, those having ordinary skill in the art will readily implement the steps necessary to provide the structures and methods disclosed herein, and will understand that the process parameters, materials, and dimensions are given by way of example only. The parameters, materials, and dimensions can be varied to achieve the desired structure as well as modifications, which are within the scope of the claims. Variations and modifications of the embodiments disclosed herein may also be made while remaining within the scope of the following claims. The illustrative techniques may be used with any suitable data center configuration and with any suitable servers, computers, and devices.