The World-Wide Web (or Web) provides numerous search engines for locating Web-based content. Search engines allow users to enter keywords, which can then be used to identify a list of documents such as Web pages. The Web pages are returned by the keyword search as a list of links that are generally sorted by the degree of match to the keywords. The list can also have paid links that are not as closely matched to the keywords, but are given a higher priority based on fees paid to the search engine company.
Search engines are often used by businesses to locate relevant products, such as Websites of providers of goods and/or services. However, the listing of the results by the match to a keyword does not identify whether the Web pages belong to a provider or merely contains a related word. Further, the search results are listed by Web pages. As numerous related Web pages may be in a single domain, e.g., constituting a Website, the results list can have a significant amount of redundancy. Accordingly, a business searcher can spend a significant amount of time accessing the links to identify which links correspond to useful Websites.
Certain exemplary embodiments are described in the following detailed description and in reference to the drawings, in which:
The Web provides a medium to allow individuals and businesses to find providers of numerous goods and services. Generally, search engines can be used to find content that is related to keywords submitted through a Web browser. A Web page, or results document, listing Web pages that are related to the keywords is typically returned. However, search engines do not necessarily make a determination regarding whether the Web pages they find are associated with providers or merely include the submitted key words. As used herein, the term “provider” should be understood to indicate a business that offers goods, services or information about goods and/or services to customers through a Website. Accordingly, the person performing the search may have to manually access each Web page to determine if the page belongs to a provider's Website.
Exemplary embodiments of the present invention can automatically determine whether references returned from a Web search represent providers or merely point to other content. Exemplary techniques use the results from a search that has been performed on the Web by a search engine or a supplier catalog, e.g., a results document containing links to Web pages matching keywords. The Web page links returned by the search engine can be automatically accessed to download the source code from the target Web pages. The source code for these Web pages can then be analyzed by searching for keywords and calculating a probabilistic value for each Web page that classify the Web page as being associated with a provider. Generally, this association means that the provider owns the Web page, but the provider may merely have a presence on the Web page.
The client system 102 can also other units operatively coupled to the processor 112 through the bus 113. These units can include tangible, machine-readable storage media, such as a storage system 122 for the long term storage of operating programs and data, such as the programs and data used in embodiments of the present techniques. Further, the client system 102 can have one or more other types of tangible, machine-readable storage media, such as a memory 124, for example, which may comprise read-only memory (ROM) and/or random access memory (RAM). In exemplary embodiments, the client system 102 can also include a network interface adapter 126, for connecting the client system 102 to a network, for example, a local area network (LAN 128), a wide-area network (WAN), or another network configuration. The LAN 128 can include routers, switches, modems, or any other kind of interface device used for interconnection.
Through the LAN 128, the client system 102 can connect to a business server 130. The business server 130 can have a storage array 132 for storing enterprise data, buffering communications, and storing operating programs for the business server 130. The business server 132 can also have associated printers 134, scanners, copiers and the like. The business server 130 can access the Web 110 through a connected router/firewall 136, providing the client system 102 with Web access. The business network discussed above should not be considered limiting. Moreover, those of ordinary skill in the art will appreciate that business networks can be far more complex and can include numerous business servers 130, printers 134, routers 136, and client systems 102, among other units. In other embodiments, the client system 102 can be directly connected to the Web 110 through the network interface adapter 126, or can be connected through a router or firewall 136. Any system that allows the client system 102 to access the Web 110 should be considered to be within the scope of the present techniques.
Through the router/firewall 136, the client system 102 can access a search engine 104 connected to the Web 110. In embodiments of the present invention, the search engine 104 can include generic search engines, such as Altavista.com, Google.com, Yahoo.com, or the like. Further, the search engine 104 can be a business specific catalog site, such as Thompson.net, among others. The client system 102 can also access providers 106-108 through the Web 110. The providers 106-108 can have single Web pages, or as shown for the third provider 108, can have multiple subpages 138-142. The subpages 138-142 can provide information or links, such as the first subpage 138, or can include forms to be filled out by the user, as shown for the second and third subpages 140 and 142.
Web browsers that can be used in embodiments include such products as: Internet Explorer, available from Microsoft; Firefox, available from Mozilla; Chrome, available from Google; Safari, available from Apple; or any number of other Web browsers. The Web browsers and, thus, embodiments of the present invention, can be implemented on any number of computing platforms, including the Macintosh operating system from Apple, the Windows operating system from Microsoft, or Linux based computing platforms, among others.
At block 204, the results document is analyzed to identify links to Web pages. Moreover, source code of the returned results document can be analyzed to identify and store the links to each of the Web pages identified by the search. At block 206, Web pages corresponding to the stored links from the results documents are accessed. For example, the links can be used in command strings, such as HTTP GET commands, or other command strings, to access each of the result pages and obtain the source code of the target page. The source code can then be analyzed to identify indicators that show the likelihood that the page belongs to a provider. The analysis can be performed, for example, by counting the number of indicators present in the source code.
Indicators that the Web page may be associated with a provider can include, for example, keywords that a business Website is likely to use, such as toll-free numbers, requests for credit card information, requests for payment information, requests for contact information, legal notices, the presence of business terminology, or phrases such as “company information”, “jobs”, “career”, or any combinations thereof. Further, indicators can include HTTP tags, such as the “FORM” tag that invites users to supply information such as contact information or the like. The indicators can also be comprised of a combination of keywords and structural information, such as the keywords “credit card” or “Visa” within the structure of html tags such as <form> and <input type=“radio” tags. Indicators can be derived in a number of ways, such as analysis of known service engagement documents, and can be weighted by their significance of indicating a provider.
A Web page may be deemed to belong to a provider if testing indicates that the Web page has a certain number of indicators. If results from a Web page do not contain a sufficient number of indicators that the Web page belongs to a provider, links originating from that Web page that are within the same domain, e.g., http://*.hp.com, can be followed and evaluated. The subsequent pages (or subpages) are then also tested to determine whether they have enough indicators to belong to a provider.
At block 208, a numerical value that indicates the probability that each Web page is associated with a provider is computed. The probability can be calculated from an indicator vector that is created for each Web page listing the indicators present on that Web page, as discussed in further detail herein. The presence of each indicator can be multiplied by a previous defined weight factor for that indicator. The products for all of the indicators can be summed and divided by the number of indicators to provide the value for the probability. Further, a combined indicator vector can be used to profile an entire Website, since some providers scatter their information for the indicators across different pages and forms, such as a first page or form that requests identification of a desired service and a second page or form requesting payment information.
After the probability values are calculated for each Web page, probabilities for each page can be displayed, as shown at block 210. Moreover, the list of links from the results document can be reordered and displayed according to which link has the highest probability of belonging to a provider. In an exemplary embodiment, Web pages that are below a user-selected probability can be dropped from the new listing of links from the results document. Previously low-ranked Web pages can be placed higher in the new results list if the analysis indicates a higher probability that the Web page belongs to a provider. In other embodiments, the original results document may be displayed, with the probabilities displayed in proximity to the links to the Web pages.
In an exemplary embodiment, a browser 302, generally located on the client computer 102 (
The results document 306 is processed by a link dereferencer 308, which scans source code of the results document 306 for links. The link dereferencer 308 can perform a requested operation, such as an HTTP GET request, to obtain the source code of each Web page 310 that is referenced by a link in the results document 306. Accessing the source code of the Web pages 310 referred to by the link can be termed “dereferencing” the link. Output from the link dereferencer 308 can comprise source code for the set of Web pages 310, each returned from one link.
In an exemplary embodiment, a user can restrict the link dereferencer 308 to obtaining source code for Web pages 310 located in a search results section of the results document 306. In this manner, the link dereferencer 308 can be prevented from obtaining source code for Web pages 310 representing advertising, sponsored links, or other material.
The source code for the Web pages 310 is processed by an indicator extractor 312. The indicator extractor 312 is a software component that is adapted to search the source code of each Web page 310 for the presence of indicators and to collect the indicators into a vector P[]. Moreover, the vector P[] can comprise all of the indicators found on the Web pages 310. The indicator extractor 312 can perform this function by identifying a list of words present in the source code of each Web page 310, then comparing the words to a list of words in an indicator base 314. The indicator base 314 is a data structure of a weighted vector of indicators that, if present in the source code of the Web pages 310, can indicate that the Web pages 310 are associated with a provider. The data structures in the indicator base 314 can be represented as IB[i,w], wherein i represents an indicator description and w represents the weight of the indicator. The indicator base 314 can be readily modified to change the results of the evaluation.
The vector P[] of indicators is submitted to an indicator evaluator 316. The indicator evaluator 316 is a software component that is adapted to compute a decision about whether one or more of the Web pages 310 have sufficient weighted indicators, based on the vector P[], to be classified as being associated with a provider. The indicator evaluator 316 can perform a further dereferencing cycle to follow links contained in the Web page 310 being evaluated, as indicated by an arrow 318. For example, if one or more of the evaluated Web pages 310 do not have sufficient indicators to make a determination, the links on the Web page 310 that are within the same URL domain can be tested. The dereferencing recursion can be halted after the content of the URL domain can be sufficiently classified as likely to be associated with a provider or not. Alternatively, the recursion can be halted after a predetermined number of dereferencing cycles or after all of the Web pages in a domain, e.g., an entire Website, have been evaluated.
The indicator evaluator 316 generates a vector 320 of probabilistic values p for each link I, SP[I,p], which can indicate the likelihood of the link pointing to a Web page 310 that is associated with a provider. A value of 1.0 can indicate a high likelihood that one or more of the Web pages 310 is associated with a provider, while a value of 0 can indicates a high likelihood that none of the Web pages 310 is associated with a provider. Accordingly, values between 0.0 and 1.0 can indicate a proportional likelihood that at least one of the Web pages 310 is associated with a provider. Further, if the indicator evaluator 316 has recursively accessed other pages linked to the Web page 310 being evaluated, the vector 320 can represent the probability that an entire Website is associated with a provider.
The vector 320 can be directly displayed or can be provided to a display unit 322. The display unit 322 can display a new results document 324 showing the results ordered by the probabilistic values, for example, from highest to lowest. The new results document 324 can omit any results that have a probabilistic value lower than a user-defined limit, for example, less than about 0.1, 0.2, 0.3, 0.5, or any other value that appropriately limits the results. Further, the new results document 324 can have items corresponding to entire Websites, for example, when the indicator evaluator 316 has recursively accessed several Web pages 310 from a single domain. The display unit 322 is not limited to displaying results as an ordered list. For example, the display unit 322 can display the initial results document 306 with the probabilistic value for each of the Web pages 310 displayed in proximity to the link for that page.
The various software components discussed herein can be stored onto the tangible, machine-readable medium 400 as indicated in
An exemplary embodiment of the present invention was tested to determine the efficacy of the techniques. In this embodiment, the presence of FORM pages and the accompanying requests for client information, were used as indicators that Web pages could belong to providers. Specifically, the indicator base (IB[I,w]) used for the test is shown in columns 2 (i) and 3 (w) of Table 1.
The information in Table 1 was assembled by examining the Web pages from a number of providers. It was discovered that choosing indicators where the site asks for information from the client was an effective way of narrowing down sites that might be owned by providers. The weights for each dimension (w), as shown in column 3 were then established. For example, many Web pages have forms for searching and many businesses have toll free numbers so they are not, by themselves, clear indicators of a provider. Accordingly, the weight of these indicators was reduced to 0.6 in this example.
As can be seen by weighting factor (w) used in row 16, the weighting factors are not limited to positive values. Thus, a negative weighting factor can be used to account for the occurrence of items that militate against the Web page belonging to a provider. If there is a particularly important negative characteristic such as a long table of similar entries likely found in a directory of services but not the provider itself (it is a directory service), then one can assign a high negative weight to reject such Web pages.
An example Web page was analyzed using the information in Table 1. A comparison of the source code for the Web page with the indicators shown in column 2 resulted in the true/false indication shown in column 4, which is 1 if the indicator was present and 0 if the indicator was not present. Many variants are possible, for example, the number of times an indicator appears in a Web page could be used in place of the true/false indication.
The true/false indication in column 4 was multiplied by the weight in column 3, resulting in the values shown in column 5. These values were summed, providing the value of 11.20 in row 17, and normalized by the number of dimensions, providing the value of 0.7 in row 18. An upper threshold may be set to indicate the association of the Web page with a provider, for example, 0.6 in the present example. As the normalized value, 0.7, is above this threshold the Web page is likely to be associated with a provider.
A lower threshold may be set to indicate if a Website is likely not associated with a provider, for example, 0.1. If the normalized sum is between those values, then the indicator evaluator may keep crawling that domain to get a clearer indication, e.g., above the higher threshold or below the lower threshold. The weights and thresholds could be set by analyzing the sites of desired types of known providers and known non-providers. More complex algorithms may also be defined.