The present invention relates to the categorization of content in general, and more particularly to the categorization of computer network-based content.
The Internet's vast array of web sites and enormous pools of information have the capability of overwhelming a typical web surfer. While each web site may attempt to cater its services to a specific clientele, a web surfer interested in a particular set of services might not know in advance which web site will provide the services he is interested in. Search engines, such as yahoo™, provide one mechanism to enable web surfers to limit and focus their browsing to a subset of websites. The information available on the web is organized and typically categorized by the search engines and stored on the search engine's web server.
Unfortunately, this reliance on search engines limits a web surfer's choices to web sites monitored by the search engine and requires the web surfer to accept the search engine's categorization of web sites. Web sites that are not known to a search engine or not categorized in a way that the web surfer expects may never be found.
Categorization of web pages is a multi-faceted science. Content-based search engines, such as Google™, extract keywords from web pages and enable searches of these keywords. Category-based search engines, such as Yahoo™, organizes web sites into categories, often after much manual manipulation by search engine managers.
The content currently displayed by the browser is perhaps the best indication of what a web surfer is searching for. While search engines provide a context for the content, web surfers that directly access a service provider's web site have no contextual information. A web surfer may like what he sees but is unable to find similar web sites.
The present invention discloses a system and method for categorizing computer network-based content, such as web pages.
In one aspect of the present invention a method is provided for content categorization, the method including firstly retrieving content from a first content source from among a categorized list of content sources, extracting a plurality of words from the firstly retrieved content, associating any of the words with a category to which the firstly retrieved content is associated in the categorized list, secondly retrieving content from a second content source independently from the categorized list of content sources, extracting a plurality of words from the secondly retrieved content, and associating the secondly retrieved content with the category where any of the words in the secondly retrieved content matches any of the words in the firstly retrieved content, where the match is in accordance with a predefined heuristic.
In another aspect of the present invention the method further includes constructing an occurrence table relating each of a plurality of structures of the firstly retrieved content with any unique occurrences of any of the words in the firstly retrieved content which appear within the structure and a number of the occurrences thereof.
In another aspect of the present invention the method further includes removing predefined ones of the words in the firstly retrieved content from the occurrence table.
In another aspect of the present invention the method further includes removing predefined common articles of language.
In another aspect of the present invention the first associating step includes constructing a word relationship table from the associations of the words in the firstly retrieved content and the category.
In another aspect of the present invention the method further includes maintaining the association with the category as part of a hierarchy of a plurality of categories.
In another aspect of the present invention any of the steps are performed by a server.
In another aspect of the present invention any of the steps are performed by a client.
In another aspect of the present invention a method is provided for content categorization, the method including retrieving content from a content source, extracting a plurality of words from the retrieved content, and associating the retrieved content with a category where any of the words in the retrieved content matches any word in a group of words previously associated with the category, where the match is in accordance with a predefined heuristic.
In another aspect of the present invention the method further includes presenting information relating to the category via a user interface. In another aspect of the present invention the method further includes presenting the category via within a window on a display of a computer which retrieved the content.
In another aspect of the present invention the method further includes presenting a parent category of the category via within a window on a display of a computer which retrieved the content.
In another aspect of the present invention either of the extracting and associating steps includes applying the heuristic to a first portion of the content, and thereafter applying the heuristic to a second portion of the content where no category match is found for the first portion.
In another aspect of the present invention the associating step includes associating the retrieved content with a plurality of categories, and selecting one of the categories having the most letters.
In another aspect of the present invention the associating step includes associating the retrieved content with a plurality of categories, and selecting one of the categories having the greatest descriptive measure in accordance with a predefined measure per category.
In another aspect of the present invention the method further includes querying a second content source using one or more words associated with either of the category and the retrieved content, receiving from the second content source in response to the query one or more links to content, presenting any of the links for selection by a user, and providing access to content indicated by any of the links upon selection of the link.
In another aspect of the present invention any of the steps are performed by a client.
In another aspect of the present invention any of the steps are performed by a client.
In another aspect of the present invention a method is provided for server-side categorization of content, the method including receiving at a server a request from a client for content from the server, extracting a plurality of words from the retrieved content, associating the retrieved content with a category where any of the words in the retrieved content matches any word in a group of words previously associated with the category, where the match is in accordance with a predefined heuristic, and modifying the content in accordance with a predefined modification associated with the category.
In another aspect of the present invention the modifying step includes inserting into the content an advertisement associated with the category.
In another aspect of the present invention the method further includes selecting one category from among a plurality of the categories associated with the requested content in accordance with a function of the expected value of the categories.
In another aspect of the present invention the selecting step includes selecting the category for which the click-thru rate for advertisements associated with the category is greatest.
In another aspect of the present invention the method further includes selecting one category from among a plurality of the categories associated with the requested content in accordance with a predefined selection preference order of the categories.
In another aspect of the present invention the method further includes selecting one category from among a plurality of the categories associated with the requested content in accordance with a combined selection heruristic based on a function of the expected value of the categories and a predefined selection preference order of the categories.
In another aspect of the present invention a system is provided for content categorization, the system including means for firstly retrieving content from a first content source from among a categorized list of content sources, means for extracting a plurality of words from the firstly retrieved content, means for associating any of the words with a category to which the firstly retrieved content is associated in the categorized list, means for secondly retrieving content from a second content source independently from the categorized list of content sources, means for extracting a plurality of words from the secondly retrieved content, and means for associating the secondly retrieved content with the category where any of the words in the secondly retrieved content matches any of the words in the firstly retrieved content, where the match is in accordance with a predefined heuristic.
In another aspect of the present invention the system further includes an occurrence table relating each of a plurality of structures of the firstly retrieved content with any unique occurrences of any of the words in the firstly retrieved content which appear within the structure and a number of the occurrences thereof.
In another aspect of the present invention the system further includes means for removing predefined ones of the words in the firstly retrieved content from the occurrence table.
In another aspect of the present invention the system further includes means for removing predefined common articles of language.
In another aspect of the present invention the system further includes a word relationship table including the associations of the words in the firstly retrieved content and the category.
In another aspect of the present invention the system further includes where the association with the category is part of a hierarchy of a plurality of categories.
In another aspect of the present invention any of the means are embodied in a server.
In another aspect of the present invention any of the means are embodied in a client.
In another aspect of the present invention a system is provided for content categorization, the system including means for retrieving content from a content source, means for extracting a plurality of words from the retrieved content, and means for associating the retrieved content with a category where any of the words in the retrieved content matches any word in a group of words previously associated with the category, where the match is in accordance with a predefined heuristic.
In another aspect of the present invention the system further includes means for presenting information relating to the category via a user interface. In another aspect of the present invention the system further includes means for presenting the category via within a window on a display of a computer which retrieved the content.
In another aspect of the present invention the system further includes means for presenting a parent category of the category via within a window on a display of a computer which retrieved the content.
In another aspect of the present invention either of the extracting and associating means are operative to apply the heuristic to a first portion of the content, and thereafter apply the heuristic to a second portion of the content where no category match is found for the first portion.
In another aspect of the present invention the means for associating is operative to associate the retrieved content with a plurality of categories, and select one of the categories having the most letters.
In another aspect of the present invention the means for associating is operative to associate the retrieved content with a plurality of categories, and select one of the categories having the greatest descriptive measure in accordance with a predefined measure per category.
In another aspect of the present invention the system further includes means for querying a second content source using one or more words associated with either of the category and the retrieved content, means for receiving from the second content source in response to the query one or more links to content, means for presenting any of the links for selection by a user, and means for providing access to content indicated by any of the links upon selection of the link.
In another aspect of the present invention any of the means are embodied in a client.
In another aspect of the present invention any of the means are embodied in a client.
In another aspect of the present invention a system is provided for server-side categorization of content, the system including means for receiving at a server a request from a client for content from the server, means for extracting a plurality of words from the retrieved content, means for associating the retrieved content with a category where any of the words in the retrieved content matches any word in a group of words previously associated with the category, where the match is in accordance with a predefined heuristic, and means for modifying the content in accordance with a predefined modification associated with the category.
In another aspect of the present invention the means for modifying step is operative to insert into the content an advertisement associated with the category.
In another aspect of the present invention the system further includes means for selecting one category from among a plurality of the categories associated with the requested content in accordance with a function of the expected value of the categories.
In another aspect of the present invention the means for selecting is operative to select the category for which the click-thru rate for advertisements associated with the category is greatest.
In another aspect of the present invention the system further includes means for selecting one category from among a plurality of the categories associated with the requested content in accordance with a predefined selection preference order of the categories.
In another aspect of the present invention the system further includes means for selecting one category from among a plurality of the categories associated with the requested content in accordance with a combined selection heruristic based on a function of the expected value of the categories and a predefined selection preference order of the categories.
The present invention will be understood and appreciated more fully from the following detailed description taken in conjunction with the appended drawings in which:
Reference is now made to
Categorization server 100 preferably extracts the words from the retrieved content and constructs an occurrence table 170, shown in
Categorization server 100 preferably edits occurrence table 170 to remove spurious information, such as common articles of language, e.g. ‘is’, and constructs a word relationship table, such as is shown in table A below, associating words in occurrence table 170 with their respective category, such as the category under which the retrieved content is categorized as indicated by one or more of the categorized lists provided by one or more search engines. Once a word has been associated with a category, it may be used to indicate that other content, even content that has not been categorized by a search engine, may belong to the same category. For example, as per table A, an HTML document whose URL includes the word ‘DVD’, such as in ‘www.dvdguys.com’, may be considered to belong to the category ‘electronics’ based on the existing association between the word ‘DVD’ and the category ‘electronics’.
Elements of Table A are defined as follows:
Reference is now made to
Categorizer 220 constructs occurrence table 170 as described hereinabove with reference to
The current document is said to belong to a particular category where:
Categorizer 220 is preferably implemented to optimize the processing time necessary to match occurrence table 170 with word relationship table 180. For example, categorizer 220 may first apply heuristics to the content title, found early in a web page, and continue to apply heuristics to the body only if the title heuristics are inconclusive, i.e. occurrence table 170 does not match any category in word relationship table 180 following the title heuristics.
Word relationship table 180 may include multiple descriptions of a category. Categorizer 220 preferably extracts from word relationship table 180 the most descriptive words of a category to present to client 200, as described hereinbelow. In one methodology, the length of a word may be utilized to determine the descriptive nature of a word without manual intervention. Categorizer 220 preferably chooses the word with the most letters, i.e. longest word, as the most descriptive word. In an alternate methodology, categorizer 220 may refer to a measure of the descriptive characteristics of each word in the word relationship table 180 that is entered manually.
Categorizer may present information related to the category or categories found to correspond to the current document in browser 210, such as the category name, via a user interface, such as a computer display or speaker. Categorizer 220 preferably employs a button bar assistant 230 as shown in
Categorizer 220 may create a set of keywords based on the information and associated words found to correspond to the current document in browser 210 and search external sources, such as commercial web sites, for links to further information that are typically associated with the keywords. For example, the current document in browser 210 as shown in
Reference is now made to
Categorizer 220 may define the single best category for a requested document as a function of the expected value of the category. For example, where client 200 requests a document from amazon.com™ that describes a Nikon™ camera, categorizer 220 may determine that the top three appropriate categories in order of relevance, as defined through heuristics employed to match occurrence table 170, constructed for the document retrieved from amazon.com™, with word relationship table 180, are ‘camera,’ ‘digital camera’ and ‘lens.’ Categorizer 220 may then analyze the value of each category as a function of the click-through rate of the advertisements for each category, where advertising click-thru rates and the associations between advertisements and categories may be provided to categorizer 220 from any source using conventional techniques. If, historically, lens advertisements (i.e., advertisements that are of the ‘lens’ category) are clicked on more often than camera or digital camera advertisements, categorizer 220 may inform content server 120 that the category ‘lens’ is the single best category for the requested document.
Alternatively, a single best category may be selected based on a predefined category selection heuristic. For example, preference may be given to the category appearing in the document title, followed by the category appearing in the document body. Thus, in the above example, if the category ‘camera’ appears in the document title, it may be selected as the single best category for the document if the category ‘digital camera’ appears in the body. This selection method may be combined with selection by expected value described above in accordance with a predefined heuristic. For example, if by the selection preference method ‘camera’ should be selected over ‘digital camera’, a combined selection heuristic might give preference to non-selected category ‘digital camera’ if its click-thru rate is twice that of the selected category ‘camera.’
Once categorizer 220 determines the single or single best category for the requested content, server 120 preferably utilizes the information provided by categorizer 220 to modify the document requested by client 200. For example, the document requested may include a placeholder for an advertisement. Server 120 preferably modifies the document by removing the placeholder and inserting an advertisement for camera lenses from any source of advertisement using conventional techniques.
It is appreciated that one or more of the steps of any of the methods described herein may be omitted or carried out in a different order than that shown, without departing from the true spirit and scope of the invention.
While the methods and apparatus disclosed herein may or may not have been described with reference to specific computer hardware or software, it is appreciated that the methods and apparatus described herein may be readily implemented in computer hardware or software using conventional techniques.
While the present invention has been described with reference to one or more specific embodiments, the description is intended to be illustrative of the invention as a whole and is not to be construed as limiting the invention to the embodiments shown. It is appreciated that various modifications may occur to those skilled in the art that, while not specifically shown herein, are nevertheless within the true spirit and scope of the invention. Thus, the present invention need not be limited to the field of advertising, but may be employed in any context where content recognition is required, such as in support of advertising, content control, web crawling, or any other context that may require it's use.