Claims
- 1. A method for generating a virtual document for a target web page, the target web page being associated with a universal resource locator, the method comprising the steps of:
(a) locating a plurality of universal resource locators associated with web pages that cite the target web page; (b) downloading the web pages that cite the target web page or obtaining contents of the web pages; (c) traversing each web page or obtained content for each web page to extract extended anchortext for at least one hyperlink that links each web page to the target web page; and (d) creating a virtual document comprising the extracted extended anchortext of each web page.
- 2. A method for generating a virtual document according to claim 1, wherein a web index is used for locating the plurality of universal resource locators that cite the target web page.
- 3. A method for generating a virtual document according to claim 1, wherein a data cache stores the contents of the web pages.
- 4. A method for generating a virtual document according to claim 1, wherein the extracted extended anchortext comprises a predetermined number of words before and a predetermined number of words after the at least one hyperlink hat links each web page to the target web page.
- 5. A method for generating a virtual document according to claim 4, wherein the predetermined number of words before the at least one hyperlink is 25 words and the predetermined number of words after the at least one hyperlink is 25 words.
- 6. A system for generating a virtual document for a target web page, the target web page being associated with a universal resource locator, the system comprising:
backlink locator for locating a plurality of universal resource locators associated with web pages that cite the target web page; web page downloader for downloading the web pages that cite the target web page or a data cache for obtaining contents of the web pages; extended anchortext extractor for traversing each web page or obtained content for each web page to extract extended anchortext for at least one hyperlink that links each web page to the target web page; and extended anchortext combiner for creating a virtual document comprising the extracted extended anchortext of each web page.
- 7. A system for generating a virtual document according to claim 6, wherein the extracted extended anchortext comprises a predetermined number of words before and a predetermined number of words after the at least one hyperlink hat links each web page to the target web page.
- 8. A system for generating a virtual document according to claim 7, wherein the predetermined number of words before the at least one hyperlink is 25 words and the predetermined number of words after the at least one hyperlink is 25 words.
- 9. A method for determining whether a target web page is to be classified into a category of similar web pages, the method comprising the steps of:
(a) generating a corresponding virtual document for the target web page, the virtual document comprising extended anchortext extracted from each of a plurality of web pages that includes at least one hyperlink citing the target web page; (b) determining classification of the corresponding virtual document using a trained virtual document classifier; (c) generating a classification output for the target web page, the classification output being representative of whether the target web page is to be classified into the category of similar web pages on the basis of the classification determination of the corresponding virtual document.
- 10. A method for determining whether a target web page is to be classified into a category of similar web pages according to claim 9, wherein the step of generating a corresponding virtual document comprises the steps of:
locating a plurality of universal resource locators associated with web pages that cite the target web page; downloading the web pages that cite the target web page or obtaining contents of the web pages; traversing each web page or obtained content for each web page to extract extended anchortext for at least one hyperlink that links each web page to the target web page; and creating the corresponding virtual document comprising the extracted extended anchortext of each web page.
- 11. A method for determining whether a target web page is to be classified into a category of similar web pages according to claim 9, wherein the method further comprises a step of training the virtual document classifier.
- 12. A method for determining whether a target web page is to be classified into a category of similar web pages according to claim 11, wherein the step of training the virtual document classifier comprises the steps of:
inputting a set of labeled virtual documents into the virtual document classifier, a label associated with each labeled virtual document representing whether each associated virtual document is a member of a positive set of virtual documents or a member of a negative set of virtual documents; producing a prediction rule from the labeled set of virtual documents for determining a label of an unlabeled virtual document that is input into the virtual classifier during classification.
- 13. A system for determining whether a target web page is to be classified into a category of similar web pages, the system comprising:
a virtual document generator for generating a corresponding virtual document for the target web page, the virtual document comprising extended anchortext extracted from each of a plurality of web pages that includes at least one hyperlink citing the target web page; and a virtual document classifier for determining classification of the corresponding virtual document and for generating a classification output for the target web page, the classification output being representative of whether the target web page is to be classified into the category of similar web pages on the basis of the classification determination of the corresponding virtual document.
- 14. A system for determining whether a target web page is to be classified into a category of similar web pages according to claim 13, wherein to generate the corresponding virtual document for the target web page the virtual document generator:
locates a plurality of universal resource locators associated with web pages that cite the target web page; downloads the web pages that cite the target web page or obtains contents of the web pages; traverses each web page or obtained content for each web page to extract extended anchortext for at least one hyperlink that links each web page to the target web page; and creates the corresponding virtual document comprising the extracted extended anchortext of each web page.
- 15. A system for determining whether a target web page is to be classified into a category of similar web pages according to claim 13, wherein the virtual document classifier is trained.
- 16. A system for determining whether a target web page is to be classified into a category of similar web pages according to claim 15, wherein virtual document classifier training comprises the virtual document classifier:
inputting a set of labeled virtual documents into the virtual document classifier, a label associated with each labeled virtual document representing whether each associated virtual document is a member of a positive set of virtual documents or a member of a negative set of virtual documents; and producing a prediction rule from the labeled set of virtual documents for determining a label of an unlabeled virtual document that is input into the virtual classifier during classification.
- 17. A method for determining whether a target web page is to be classified into a category of similar web pages, the target web page being associated with a universal resource locator, the method comprising the steps of:
(a) generating a corresponding virtual document-for the target web page, the virtual document comprising extended anchortext extracted from each of a plurality of web pages that includes at least one hyperlink citing the target web page; (b) determining classification of the corresponding virtual document using a trained virtual document classifier; (c) generating a classification output for the target web page, the classification output representative of whether the target web page is to be classified into the category of similar web pages on the basis of the classification determination of the corresponding virtual document; (d) downloading the target web page or obtaining contents of the target web page; (e) generating a classification output of the target web page utilizing a trained full-text classifier; and (f) combining the classification output of the virtual document classifier and the classification output of the full-text classifier to generate a combined classification output for the target web page, representing whether the target web page is to be classified into the category of similar web pages.
- 18. A method for determining whether a target web page is to be classified into a category of similar web pages according to claim 17, wherein a data cache stores the contents of the target web page.
- 19. A method for determining whether a target web page is to be classified into a category of similar web pages according to claim 17, wherein the step of generating a corresponding virtual document comprises the steps of:
locating a plurality of universal resource locators associated with web pages that cite the target web page; downloading the web pages that cite the target web page or obtaining contents of the web pages; traversing each web page or obtained content for each web page to extract extended anchortext for at least one hyperlink that links each web page to the target web page; and creating the corresponding virtual document comprising the extracted extended anchortext of each web page.
- 20. A method for determining whether a target web page is to be classified into a category of similar web pages according to claim 17, wherein the method further comprises a step of training the virtual document classifier.
- 21. A method for determining whether a target web page is to be classified into a category of similar web pages according to claim 20, wherein the step of training the virtual document classifier comprises the steps of:
inputting a set of labeled virtual documents into the virtual document classifier, a label associated with each labeled virtual document representing whether each associated virtual document is a member of a positive set of virtual documents or a member of a negative set of virtual documents; and producing a prediction rule from the labeled set of virtual documents for determining a label of an unlabeled virtual document that is input into the virtual classifier during classification.
- 22. A method for determining whether a target web page is to be classified into a category of similar web pages according to claim 17, wherein the method further comprises a step of training the full-text classifier.
- 23. A method for determining whether a target web page is to be classified into a category of similar web pages according to claim 22, wherein the step of training the virtual document classifier comprises the steps of:
inputting a set of labeled web pages into the full-text classifier, a label associated with each labeled web page representing whether each associated web page is a member of a positive set of web pages or a member of a negative set of web pages; and producing a prediction rule from the labeled set of web pages for determining a label of an unlabeled web page that is input into the virtual classifier during classification.
- 24. A method for determining whether a target web page is to be classified into a category of similar web pages according to claim 17, wherein the classification output of the full-text classifier is S1 and the classification output of the virtual document classifier is S2 and the combined classification output is:
classifying the target web page as positive for membership in the category of similar web pages if S2 is greater than 0; classifying the target web page as negative for membership in the category of similar web pages if S2 is not greater than 0 and S2 is less than −1; classifying the target web page as positive for membership in the category of similar web pages if S2 is not less than −1 and S1 is greater than an absolute value of S2; and classifying the target web page as negative for membership in the category of similar web pages if S2 is not less than −1 and S1 is not greater than an absolute value of S2.
- 25. A system for determining whether a target web page is to be classified into a category of similar web pages, the target web page being associated with a universal resource locator, the system comprising:
a virtual document generator for generating a corresponding virtual document for the target web page, the virtual document comprising extended anchortext extracted from each of a plurality of web pages that includes at least one hyperlink citing the target web page; a virtual document classifier for determining classification of the corresponding virtual document and for generating a classification output for the target web page, the classification output representative of whether the target web page is to be classified into the category of similar web pages on the basis of the classification determination of the corresponding virtual document; a web page downloader for downloading the target web page or a data cache for obtaining contents of the target web page; a full-text classifier for generating a classification output of the target web page; a combiner for combining the classification output of the virtual document classifier and the classification output of the full-text classifier to generate a combined classification output for the target web page, representing whether the target web page is to be classified into the category of similar web pages.
- 26. A system for determining whether a target web page is to be classified into a category of similar web pages according to claim 25, wherein to generate the corresponding virtual document for the target web page the virtual document generator:
locates a plurality of universal resource locators associated with web pages that cite the target web page; downloads the web pages that cite the target web page or obtaining contents of the web pages; traverses each web page or obtained content for each web page to extract extended anchortext for at least one hyperlink that links each web page to the target web page; and creates the corresponding virtual document comprising the extracted extended anchortext of each web page.
- 27. A system for determining whether a target web page is to be classified into a category of similar web pages according to claim 25, wherein the virtual document classifier is trained.
- 28. A system for determining whether a target web page is to be classified into a category of similar web pages according to claim 27, wherein virtual document classifier training comprises the virtual document classifier:
inputting a set of labeled virtual documents into the virtual document classifier, a label associated with each labeled virtual document representing whether each associated virtual document is a member of a positive set of virtual documents or a member of a negative set of virtual documents; and producing a prediction rule from the labeled set of virtual documents for determining a label of an unlabeled virtual document that is input into the virtual classifier during classification.
- 29. A system for determining whether a target web page is to be classified into a category of similar web pages according to claim 25, wherein the full-text classifier is trained.
- 30. A system for determining whether a target web page is to be classified into a category of similar web pages according to claim 29, wherein full-text classifier training comprises the full-text classifier:
inputting a set of labeled web pages into the full-text classifier, a label associated with each labeled web page representing whether each associated web page is a member of a positive set of web pages or a member of a negative set of web pages; producing a prediction rule from the labeled set of web pages for determining a label of an unlabeled web page that is input into the virtual classifier during classification.
- 31. A system for determining whether a target web page is to be classified into a category of similar web pages according to claim 25, wherein the classification output of the full-text classifier is S1 and the classification output of the virtual document classifier is S2 and the combined classification output is:
classifying the target web page as positive for membership in the category of similar web pages if S2 is greater than 0; classifying the target web page as negative for membership in the category of similar web pages if S2 is not greater than 0 and S2 is less than −1; classifying the target web page as positive for membership in the category of similar web pages if S2 is not less than −1 and S1 is greater than an absolute value of S2; and classifying the target web page as negative for membership in the category of similar web pages if S2 is not less than −1 and S1 is not greater than an absolute value of S2.
- 32. A method for generating a description of a set of web pages in a collection comprising a plurality of web pages, the method comprising the steps of:
(a) defining a positive set of web pages in the collection and a negative set of web pages representing all web pages or a random set of web pages in the collection; (b) generating respective histograms for the positive set of web pages and the negative set of web pages, the generation of the respective histograms comprising: i) generating a virtual document for each target web page in the positive and negative sets, the virtual document comprising extended anchortext extracted from each of a plurality of web pages that includes at least one hyperlink citing each target web page in the positive and negative sets; ii) generating a document vector describing features in the virtual document for each target web page in the positive and negative sets; and iii) creating the respective histograms and updating the respective histograms based on the document vector of the virtual document for each target web page in the positive and negative sets; (c) applying a predetermined threshold to the respective histograms for the positive set of web pages and the negative set of web pages to eliminate a plurality of non-descriptive features that occur in less than a predetermined percentage of web pages in the positive and negative sets, to thereby produce a listing of possible descriptive features; (d) evaluating entropy for each possible descriptive feature in the listing of the possible descriptive features; and (e) sorting the listing of the possible descriptive features according to the evaluated entropy for each descriptive feature and selecting a predetermined number of highest-ranked descriptive features to describe the positive set of web pages.
- 33. A method for generating a description of a set of web pages according to claim 32, wherein the step of generating a virtual document for each target web page in the positive and negative sets comprises the following steps:
locating a plurality of universal resource locators associated with web pages that cite each target web page; downloading the web pages that cite each target web page or obtaining contents of the web pages; traversing each web page or obtained content for each web page to extract extended anchortext for at least one hyperlink that links each web page to each target web page; and creating the corresponding virtual document comprising the extracted extended anchortext of each web page.
- 34. A system for generating a description of a set of web pages in a collection comprising a plurality of web pages, the system comprising:
a means for defining a positive set of web pages in the collection and a negative set of web pages representing all web pages or a random set of web pages in the collection; a histogram generator for generating respective histograms for the positive set of web pages and the negative set of web pages, the histogram generator comprising: i) a virtual document generator for generating a virtual document for each target web page in the positive and negative sets, the virtual document comprising extended anchortext extracted from each of a plurality of web pages that includes at least one hyperlink citing each target web page in the positive and negative sets; ii) a document vector generator for generating a document vector describing features in the virtual document for each target web page in the positive and negative sets; and iii) a histogram updater for creating the respective histograms and updating the respective histograms based on the document vector of the virtual document for each target web page in the positive and negative sets; a threshold applicator for applying a predetermined threshold to the respective histograms for the positive set of web pages and the negative set of web pages to eliminate a plurality of non-descriptive features that occur in less than a predetermined percentage of web pages in the positive and negative sets, to thereby produce a listing of possible descriptive features; an entropy evaluator for evaluating entropy of each possible descriptive feature in the listing of the possible descriptive features; and a feature ranking tool for sorting the listing of the possible descriptive features according to the evaluated entropy for each descriptive feature and selecting a predetermined number of highest-ranked descriptive features to describe the positive set of web pages.
- 35. A method for generating a description of a set of web pages according to claim 33, wherein the step of generating a virtual document for each target web page in the positive and negative sets comprises the following steps:
a backlink locator for locating a plurality of universal resource locators associated with web pages that cite each target web page; a web page downloader for downloading the web pages that cite each target web page or a data cache for obtaining contents of the web pages; an extended anchortext extractor for traversing each web page or obtained content for each web page to extract extended anchortext for at least one hyperlink that links each web page to each target web page; and an extended anchortext combiner for creating the corresponding virtual document comprising the extracted extended anchortext of each web page.
CROSS-REFERENCE
[0001] This application claims the benefit of a U.S. Provisional Application 60/359,197 filed Feb. 22, 2002, which is incorporated herein in its entirety.
Provisional Applications (1)
|
Number |
Date |
Country |
|
60359197 |
Feb 2002 |
US |