The proliferation of the World Wide Web has made enormous amounts of information available through the Internet, and numerous search engines are available to help users sort through the information. For example, a user will choose a search service and then enter a query. The search service accepts the query and returns a result list of documents or links that satisfy the query. It is desirable that the list of results be ordered such that documents and/or links that are most relevant to the user's query appear first, and search engines typically include one or more algorithms to provide some sort of ranking of the search results for the user.
Ranking algorithms may be classified as query-dependent (also called dynamic) or query-independent (also called static). Query-dependent ranking algorithms use the terms in the query while query-independent algorithms do not. Query-independent ranking algorithms assign a quality score to each document on the web. Therefore, query-independent ranking algorithms can be run ahead of time and need not be rerun whenever a user submits a query.
Ranking algorithms may also be broadly classified into content-based, usage-based, and link-based ranking algorithms. Content-based ranking algorithms use the words in a document to rank the document among other documents. For example, a query-dependent content-based ranking algorithm could assign higher scores to documents that contain the query terms in the beginning of a document or in a prominent font. Usage-based ranking algorithms assign a score based on an estimate of how often the documents are viewed, for example, by examining web proxy logs or monitoring click-through on the results pages of the search engine. Link-based ranking algorithms use the hyperlinks between web pages to rank web pages. For example, a static link-based ranking algorithm could assign a score to each web page that is proportional to the number of links pointing to that page, based on the notion that links pointing to a page are actually an endorsement of the page.
PageRank® is a well-known and commonly used query-independent link-based ranking algorithm. Assume that the set of known web pages and links between them induces a graph with vertex set V, where each vertex corresponds to a web page, and edge set E, where each directed edge (u,v) corresponds to a hyperlink from page u to page v. Let O(u) denote the outdegree of vertex u, i.e., the number of hyperlinks embedded in web page u, and let d be a number between 0 and 1, e.g., 0.15. The PageRank vector R is a vector whose values R(v) satisfy the following equation, which are normalized to have a total sum of 1. Note that a page having an outdegree of 0 will need to be handled as a special case.
The PageRank formula is often explained as follows. Consider a web surfer who is performing a random walk on the web. At every step along the walk, the surfer moves from one web page to another, using the following algorithm. With some probability d, the surfer selects a web page uniformly at random and jumps to it; otherwise, the surfer selects one of the outgoing hyperlinks in the current page uniformly at random and follows it. Because of this metaphor, the number d is sometimes called the “jump probability,” namely the probability that the surfer will jump to a completely random page. If the web surfer jumps with probability d and there are |V| web pages, the probability of jumping to a particular page is d/|V|. Since any page can be reached by jumping, every page is guaranteed a score of at least d/|V|. The PageRank of a particular web page is then the fraction of time that the random surfer will spend at that page.
PageRank scores may be used to rank query results. A search engine employing PageRank will rank pages with high PageRank scores higher than those with low scores, assuming that everything else is the same. Since most users of search engines tend to examine only the first few results, operators of commercial web sites would certainly prefer that links to their sites appear early in the result listing, that is, that their web pages receive high PageRank scores. Thus, commercial web site operators clearly have an incentive to try and artificially increase the PageRank scores of the pages on their web sites.
One way to increase the PageRank score of a web page v is by having many other pages link to it. If all of the pages that link to web page v have low PageRank scores, each individual page would appear to contribute very little to the PageRank score of page v. However, since every linking page is guaranteed to have a minimum PageRank score of d/|V|, links from many such low quality pages can still inflate the PageRank score.
In practice, the vulnerability of PageRank to artificial inflation of scores is being exploited by web sites that contain a very large set of pages whose only purpose is to “endorse” a main home page. Typically, these endorsing pages contain a link to the page that is to be endorsed, and one or more links to other endorsing pages. Once a web crawler has stumbled across any of the endorsing pages, it continues to download more endorsing pages since the endorsing pages link to other endorsing pages, thereby accumulating a large number of endorsing pages. This large number of endorsing pages, all of them endorsing a single page, artificially inflates the PageRank score of the page that is being endorsed.
This problem was addressed and partially solved in U.S. Patent Publication No. 2005/0060297 entitled Systems And Methods For Ranking Documents Based Upon Structurally Interrelated Information, where the PageRank technique was modified to provide resistance to link spam by giving more weight to hosts/domains/servers that contain many web pages.
However, it remains desirable to find improved query-independent link-based ranking techniques. In particular, such techniques should significantly reduce the effects of artificially created endorsement links, and reduce the incentive for creating such links for the purpose of inflating PageRank scores.
The present disclosure describes a method for static ranking of search results. Search engines are typically configured such that search results having a higher ranking or score are listed first. This disclosure is directed to a modified scoring technique whereby the score is biased toward web pages linked to blogs. The notion is that blogs still represent human created content, and that links from blogs are genuine endorsements of the target web page.
As a preliminary matter, web pages must be classified as either blog or non-blog. This may be accomplished, for example, by web crawling using predefined criteria that has been shown to evidence blogs, such as (i) whether a page is hosted in a known blog hosting DNS domain; (ii) whether features from non-HTML markup words and phrases are contained in the page; (iii) whether targets of outgoing links are in the page; and (iv) whether the string “blog” occurs in the URL.
After classifying web pages, a search engine is used to formulate a query, and results are returned in an order determined by the ranking or scoring mechanism. In this disclosure, the scoring mechanism is modified to exhibit a bias toward web pages that are linked to by blogs. Specifically, the well known PageRank® scoring mechanism is modified such that its reset vector incorporates a bias toward web pages that are linked to by blogs.
This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.
The present disclosure describes a method for ranking search results by incorporating a bias toward data obtained from blogs.
The conventional PageRank® algorithm takes on the audacious task of condensing every page on the Web into a single number, its PageRank score. Thus, PageRank is a global ranking of all Web pages, regardless of their content, based solely on their location in the Web's graph structure. Using the conventional PageRank technique, search results are ordered so that more important Web pages are given preference. The intuition behind PageRank is that it uses information which is external to the Web pages themselves—their backlinks—which provide a kind of peer review. Furthermore, backlinks from “important” pages are considered more significant than backlinks from average links by recursive definition.
It is also known that personalized PageRank scores can create a view of the Web from a particular perspective, e.g., by taking a user's bookmarks and inflating the PageRank scores of those pages in the user's bookmarks. However, personalized PageRank does not explicitly deal with the problem of link spamming because there is still a minimum score associated with each link spam Web page. Accordingly, a link spammer can still create (automatically, if desired) a multitude of Web pages on a single Web server, each having their own minimum PageRank score, that artificially inflate the score of a target Web page by endorsing each other and the target page.
One of ordinary skill in the art can appreciate that the techniques disclosed herein can be implemented in connection with any computer or other client or server device, which can be deployed as part of a computer network, or in a distributed computing environment. In this regard, the present disclosure pertains to any computer system or environment having any number of memory or storage units, and any number of applications and processes occurring across any number of storage units or volumes, which may be used in connection with processes for ranking documents in accordance with the present disclosure. The present disclosure may apply to an environment with server computers and client computers deployed in a network environment or distributed computing environment, having remote or local storage. The present disclosure may also be applied to standalone computing devices having programming language functionality, and interpretation and execution capabilities for generating, receiving and transmitting information in connection with remote or local services. Downloading and analyzing Web pages is particularly relevant to those computing devices operating in a network or distributed computing environment, and thus the ranking algorithms and techniques in accordance with the present disclosure can be applied with great efficacy in those environments.
Distributed computing provides sharing of computer resources and services by exchange between computing devices and systems. These resources and services include the exchange of information, cache storage, and disk storage for files. Distributed computing takes advantage of network connectivity, allowing clients to leverage their collective power to benefit the entire enterprise. In this regard, a variety of devices may have applications, objects or resources that may implicate the ranking techniques and processes of the disclosure.
In a network environment in which the communications network/bus 14 is the Internet, for example, the servers 10a, 10b, etc. can be Web servers with which the clients 110a, 110b, 110c, 110d, 110e, etc. communicate via any of a number of known protocols such as HTTP. Servers 10a, 10b, etc. may also serve as clients 110a, 110b, 110c, 110d, 110e, etc., as may be characteristic of a distributed computing environment.
Communications may be wired or wireless, where appropriate. Client devices 110a, 110b, 110c, 110d, 110e, etc. may or may not communicate via communications network/bus 14, and may have independent communications associated therewith. For example, in the case of a TV or VCR, there may or may not be a networked aspect to the control thereof. Each client computer 110a, 110b, 110c, 110d, 110c, etc. and server computer 10a, 10b, etc. may be equipped with various application program modules and with connections or access to various types of storage elements or objects, across which files or data streams may be stored or to which portion(s) of files or data streams may be downloaded, transmitted or migrated. Any one or more of computers 10a, 10b, 110a, 110b, etc. may be responsible for the maintenance and updating of a database 20 or other storage element, such as a database or memory 20 for storing data processed according to techniques of the present disclosure. Thus, the present disclosure can be utilized in a computer network environment having client computers 110a, 110b, etc. that can access and interact with a computer network/bus 14 and server computers 10a, 10b, etc. that may interact with client computers 110a, 110b, etc. and other like devices, and databases 20.
Although not required, the techniques of this disclosure can be implemented via an operating system, for use by a developer of services for a device or object, and/or included within application software that operates in connection with the ranking techniques of the disclosure. Software may be described in the general context of computer-executable instructions, such as program modules, being executed by one or more computers, such as client workstations, servers or other devices. Generally, program modules include routines, programs, objects, components, data structures and the like that perform particular tasks or implement particular abstract data types. Typically, the functionality of the program modules may be combined or distributed as desired in various embodiments. Moreover, those skilled in the art will appreciate that the techniques of this disclosure may be practiced with other computer system configurations and protocols. Other well known computing systems, environments, and/or configurations that may be suitable for use with the disclosure include, but are not limited to, personal computers (PCs), automated teller machines, server computers, hand-held or laptop devices, multi-processor systems, microprocessor-based systems, programmable consumer electronics, network PCs, appliances, lights, environmental control elements, minicomputers, mainframe computers and the like. The techniques of this disclosure may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network/bus or other data transmission medium. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices, and client nodes may in turn behave as server nodes.
With reference to
Computer 110 typically includes a variety of computer readable media. Computer readable media can be any available media that can be accessed by computer 110 and includes both volatile and nonvolatile media, removable and non-removable media. By way of example, and not limitation, computer readable media may comprise computer storage media and communication media. Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CDROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by computer 110. Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media. Combinations of any of the above should also be included within the scope of computer readable media.
The system memory 130 includes computer storage media in the form of volatile and/or nonvolatile memory such as read only memory (ROM) 131 and random access memory (RAM) 132. A basic input/output system 133 (BIOS), containing the basic routines that help to transfer information between elements within computer 110, such as during start-up, is typically stored in ROM 131. RAM 132 typically contains data and/or program modules that are immediately accessible to and/or presently being operated on by processing unit 120. By way of example, and not limitation,
The computer 110 may also include other removable/non-removable, volatile/nonvolatile computer storage media. By way of example only,
The drives and their associated computer storage media discussed above and illustrated in
The computer 110 may operate in a networked or distributed environment using logical connections to one or more remote computers, such as a remote computer 180. The remote computer 180 may be a personal computer, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to the computer 110, although only a memory storage device 181 has been illustrated in
When used in a LAN networking environment, the computer 110 is connected to the LAN 171 through a network interface or adapter 170. When used in a WAN networking environment, the computer 110 typically includes a modem 172 or other means for establishing communications over the WAN 173, such as the Internet. The modem 172, which may be internal or external, may be connected to the system bus 121 via the user input interface 160, or other appropriate mechanism. In a networked environment, program modules depicted relative to the computer 110, or portions thereof, may be stored in the remote memory storage device. By way of example, and not limitation,
Various distributed computing frameworks have been and are being developed in light of the convergence of personal computing and the Internet. Individuals and business users alike are provided with a seamlessly interoperable and Web-enabled interface for applications and computing devices, making computing activities increasingly Web browser or network-oriented.
For example, MICROSOFT®'s managed code platform, i.e., .NET, includes servers, building-block services, such as Web-based data storage and downloadable device software. Generally speaking, the .NET platform provides (1) the ability to make the entire range of computing devices work together and to have user information automatically updated and synchronized on all of them, (2) increased interactive capability for Web pages, enabled by greater use of XML rather than HTML, (3) online services that feature customized access and delivery of products and services to the user from a central starting point for the management of various applications, such as e-mail, for example, or software, such as Office .NET, (4) centralized data storage, which increases efficiency and ease of access to information, as well as synchronization of information among users and devices, (5) the ability to integrate various communications media, such as e-mail, faxes, and telephones, (6) for developers, the ability to create reusable modules, thereby increasing productivity and reducing the number of programming errors and (7) many other cross-platform and language integration features as well.
While some exemplary embodiments herein are described in connection with software residing on a computing device, one or more portions of the invention may also be implemented via an operating system, application programming interface (API) or a “middle man” object, a control object, hardware, firmware, intermediate language instructions or objects, etc., such that the methods may be included in, supported in or accessed via all of the languages and services enabled by managed code, such as .NET code, and in other distributed computing frameworks as well.
In order to utilize the modified technique for ranking that is biased toward blogs, a method must be used to identify web pages as blogs. This can be done by enabling a web crawler to perform a series of focused or general web crawls. For example, certain features have been identified that are useful for determining the classification of a web page, and these features could be used as part of a web crawling strategy. These features are generally organized into four categories: (i) whether a page is hosted in a known blog hosting DNS domain; (ii) features from the non-HTML markup words and phrases contained in the page; (iii) the targets of outgoing links in the page; and (iv) whether the string “blog” occurs in the URL. In addition, human judgments of results from the web crawl can be used as input to the web crawler strategy to improve the probability that web pages could be identifies as blogs.
As described briefly above, the conventional PageRank algorithm assigns a numerical weight to each element in a hyperlinked set of documents, such as the world wide web. Given the adjacency matrix W of a web graph on n pages, in which wij=1 if there exists a link from page i to page j in the web, the algorithm first obtains a stochastic matrix M by dividing each row of matrix W by its row sum, treating pages with zero outdegree as a special case. The algorithm then defines the PageRank Markov chain with transition matrix:
P=(1−ε)M+εU,
where U is the uniform transition matrix with uij=1/n for all i, j, and ε is a fixed constant, usually between 0.1 and 0.2. (By thinking of U as the matrix in which each row is u=(1/n, . . . , 1/n), we can say that u is the reset vector for this version of PageRank.) The PageRank of a particular web page is its stationary probability in P. One popular interpretation of this envisions a random surfer who walks across the web, usually by choosing a random outgoing link from her current page at each step, but sometimes (with probability ε) selecting a page uniformly at random from the entire web. In this case, the PageRank of a page is the probability that the random surfer will be at that page at any time step.
Referring now to
Initially, all pages have the same probability of being landed on by the user, namely ⅙ (since there are six pages). However, for the first ranking iteration, each page sends its probability along to the linked page to be summed. Thus, since page E has three links pointing to it from pages A, B and F, those probabilities are added, resulting in a probability of ½ for page E. Since page F has two links pointing to it from pages C and D, those probabilities are added, resulting in a probability of ⅓ for page F. Since page C has one link pointing to it from page E, that probability is passed on, resulting in a probability of ⅙ for page C. On the next iteration, the probability of ½ is passed along from page F to page C; the probability of ⅙ is passed along from page C to page E; and the probability of ⅓ is passed along from page E to page F. Since the links only point to pages C, E and F, these probabilities will keep circling among just these pages.
A key assumption behind the early success of the PageRank technique is that a link on the web represents an endorsement of the target page by the presumably human author of the source page. Over time, however, this assumption has become less valid. Not only is there now a large quantity of automatically generated web content, but the success of the PageRank technique has led to the widespread deployment of measures to artificially boost the PageRank of a given web page.
One frequently exploited characteristic of PageRank is its uniform reset vector u, which has the effect of giving each of the n pages on the web an equal endorsement power. For the example just described with reference to
It can thus be seen that an agent or commercial operator could generate a very large number of web pages that reference a target page, thereby boosting the PageRank score of the target page. One countermeasure is to use a non-uniform reset vector b that is weighted towards pages that are perceived to be trustworthy. The key question is how to appropriately choose the non-uniform reset vector b.
This disclosure explores the idea of biasing the reset vector u toward pages that are linked to by blogs. This idea is based on the assumption (or hope) that blogs are still mostly human-authored, and that links from blogs generally represent sincere endorsements on the part of the authors. If this is true, then the resulting stationary distribution would tend to avoid pages that are ranked highly by the traditional PageRank technique due to link spam or automatically generated content. Specifically, the following reset vectors were evaluated:
By choosing this range of reset vectors, a better understanding might be obtained of which is more helpful: considering the links from a small, high-quality set of blogs, or the (more numerous) links from a larger set of blogs.
Evaluating the effectiveness of this approach requires two steps. First, a large web graph must be assembled on which reset vectors can be defined and a modified Page-Rank can be computed. Second, the resulting rank vectors must be evaluated for relevance.
As noted above, a large focused crawl was performed aimed at gathering blogs, and a large general web crawl was also performed that downloaded over 472 million pages (including redirects) and inferred the existence of more than 6 billion additional pages based on links contained in the downloaded set. A web graph was then constructed on the union of these two sets of pages, though note that each page in the second set necessarily has outdegree zero in the graph.
The set B then includes those downloaded pages in the web crawl that were also linked to by any page classified as a blog from our focused crawl. The sets B10 and B1 are the subsets of B that include only pages linked to by the top 10% and 1% of top-level blogs, respectively. Of the 472 million downloaded pages, B, B10, and B1 cover 53,897,337 pages, 39,400,498 pages, and 21,006,695 pages respectively. The reset vectors b*, b10, and b1 were then defined from these sets as described above, and PageRank vectors were computed using power iteration with these reset vectors.
In order to determine if the modified PageRank technique represents an improvement over the traditional PageRank technique, a large set of human-judged web pages is referenced. Specifically, this data set comprises over 66 million result pages that were returned by publicly available search engines to a set of 28,043 randomly selected search queries, of which a subset of 485,656 were judged on a scale from 0 (worst) to 5 (best). Of the large set, 19,664,033 result pages also show up in the web crawl, and of the judged subset, 339,351 are in the crawl.
The metric used to evaluate the modified rankings will be the normalized discounted cumulative gain (“NDCG”), which is tailored for web search in that it gives more weight to the quality of highly ranked pages than lower ones. More specifically, for a ranking r and a parameter k, NDCGk is defined as follows:
where q(ri) is the judged quality (on the scale of 0 to 5 referenced above) of the ith ranked page according to the ranking r, and Z is a normalization constant that makes the maximum possible value 1. Note that the discounting factor is logarithmic, meaning that the weight assigned drops off moderately at first, and very slowly afterwards. The NDCG score of the rankings for the large web crawl when restricted to the result pages was then compared with that produced by the traditional PageRank technique using the uniform reset vector.
As
At first observation, the improvements that the modified reset vectors achieve may seem small. It is worth noting, however, that even small differences in NDCG may be more significant than they appear. To place these in perspective, note that the difference in NDCG between the traditional unbiased PageRank technique and randomly generated results is only 0.08 for k=10 and 0.10 for k=5000.
To further confirm that these improvements are not the result of random noise, a range of new rankings was created by taking linear combinations of the unbiased PageRank stationary distribution and the modified PageRank distributions. Formally, if π is the PageRank vector computed from an unbiased reset vector, and π′ is the PageRank vector from a modified reset vector, then new vectors (1−α)π+απ′ for 0≦α≦1 are created. If the improvements are insignificant, one might expect the NDCG scores based on the mixed rankings to vary above and below the score obtained by the pure ranking. The results of this experiment are shown in
Perhaps surprisingly, the reset vector that incorporated all pages linked to by blogs performed better than that which used only the top 10% of blogs, which in turn was better than that which used the top 1% of blogs. This is because even links from low-ranked blogs are authentic and thus useful for ranking, which may change as spam blogs become more prevalent.
Other characteristics of the blogs may prove useful in refining the modified technique. For example, the number of subscribers to a particular blog could be evaluated and factored into the iterative process. If a blog has a large number of subscribers, suggesting that a higher endorsement level should be associated with that blog, that then the scoring method could be modified to add weight to such a blog.
The modified technique for assigning scores to web pages is illustrated in the flow chart of
Based on observed results, blogspace is an attractive setting for the classical PageRank algorithm, and indeed, blogs ranked highly by PageRank turn out to be frequently updated, more informational rather than personal, and free of spam. Thus, the links that exist in pages identified as blog pages may be used to improve PageRank by biasing the reset vector.
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims. It is intended that the scope of the invention be defined by the claims appended hereto.