User-sensitive pagerank

Information

  • Patent Application
  • 20080010281
  • Publication Number
    20080010281
  • Date Filed
    June 22, 2006
    18 years ago
  • Date Published
    January 10, 2008
    17 years ago
Abstract
Techniques are described for generating an authority value of a first one of a plurality of documents. A first component of the authority value is generated with reference to outbound links associated with the first document. The outbound links enable access to a first subset of the plurality of documents. A second component of the authority value is generated with reference to a second subset of the plurality of documents. Each of the second subset of documents represents a potential starting point for a user session. A third component of the authority value is generated representing a likelihood that a user session initiated by any of a population of users will end with the first document. The first, second, and third components of the authority value are combined to generate the authority value. At least one of the first, second, and third components of the authority value is computed with reference to user data relating to at least some of the outbound links and the second subset of documents.
Description

BRIEF DESCRIPTION OF THE DRAWINGS


FIG. 1 is a flow diagram illustrating operation of a specific embodiment of the present invention.



FIG. 2 is a simplified diagram of a network environment in which specific embodiments of the present invention may be implemented.





DETAILED DESCRIPTION OF SPECIFIC EMBODIMENTS

Reference will now be made in detail to specific embodiments of the invention including the best modes contemplated by the inventors for carrying out the invention. Examples of these specific embodiments are illustrated in the accompanying drawings. While the invention is described in conjunction with these specific embodiments, it will be understood that it is not intended to limit the invention to the described embodiments. On the contrary, it is intended to cover alternatives, modifications, and equivalents as may be included within the spirit and scope of the invention as defined by the appended claims. In the following description, specific details are set forth in order to provide a thorough understanding of the present invention. The present invention may be practiced without some or all of these specific details. In addition, well known features may not have been described in detail to avoid unnecessarily obscuring the invention.


The present invention provides a variety of ways to compute PageRank with reference to various types of data corresponding to actual user behavior. FIG. 1 is a flow diagram which illustrates this general idea. User data 100 which reflect the behavior and/or demographics of an underlying user population are collected and indexed (102). At least some of these data track the navigational behavior of the user population with regard to documents, pages, sites, and domains visited, and links selected. As described below, the user population, the computing context, and the techniques for collecting these data may vary considerably without departing from the scope of the present invention.


PageRank computation is performed for a plurality of pages and/or documents using a PageRank formulation constructed according to the present invention (104). As will be described, such PageRank formulations include at least one component which is derived with reference to the user data. In addition, the PageRank computation may be performed for each page/document on the Web or at some higher level of aggregation (e.g., site, host, domain, etc.). The PageRank computations may then be employed in support of a wide variety of applications (106) such as, for example, in relevancy determinations for the ranking of search results in response to user queries. And because the set of pages, the connections between them, and user behavior may change over time, the user data collection and PageRank computations may be iterated (dashed line) to ensure that they reflect the most current conditions in the computing environment.


Various embodiments of the present invention may employ PageRank formulations which incorporate or make use of user data in a variety of ways which address one or more of the issues described above. For example, as noted above the assumption of uniform endorsement along all outward-bound links associated with a page is unrealistic, e.g., internal links (e.g., disclaimer links) are typically not equal to external links. To the contrary, users “vote” by their behavior in terms of the links they actually select. Moreover, the popularity of links selected is not static, but changes over time.


Therefore, according to various embodiments, empirical data corresponding to link selection behavior by users are employed to weight outbound links in a PageRank formulation such that this user behavior is taken into account. According to a specific embodiment, the number of users who browsed from page i to page j along a link connecting the two pages is employed to assign to the link a weight which reflects a likelihood that a user will move along the directed edge corresponding to the link. Additional details regarding exemplary techniques by which this weighting may be accomplished are provided in U.S. Pat. No. 6,792,419 for System And Method For Ranking Hyperlinked Documents Based On A Stochastic BackoffProcesses, the entire disclosure of which is incorporated herein by reference for all purposes.


Because most pages have very little traffic associated with them, and the traffic they do have corresponds to a low confidence estimate of user intent, according to a specific embodiment of the invention, the terms in the Markov transition matrix of equation (1) may instead be derived as follows:










w
ij

=


1
+

α






n

i
->
j






deg


(
i
)


+

α





i
->
j








n

i
->
j










(
3
)







where α≧0 reflects some Laplace smoothing factor, and ni→j is the number of users following a particular link. It should be noted that coefficient α=0 corresponds to a conventional formulation of this component. Notice also that higher values of ni→j represent a higher impact on wij in agreement with the fact that higher values imply higher confidence.


While equation (3) does incorporate some measure of the likelihood that specific links will be selected by users, more specific embodiments of the invention are contemplated which reflect further refinement of the underlying assumptions. That is, for example, users are not equal. Rather, they are part of a social network in which different weights can be assigned to different users based on a variety of factors. In addition, because the popularity of pages and links change over time, the incorporation of one or more recency factors into the PageRank formulation may be desirable. Third, the use of user data enables the creation of a targeted PageRank by aggregating user behavior over a particular user segment as defined by demographics, behavioral characteristics, user profile, etc.


According to a more specific embodiment of the invention, these refinements result in the following generalization of equation (3) in which u denote a user and S stands for a particular user segment:










w
ij

=


1
+

α






u


S

u


i

->
j








f


(
u
)







deg


(
i
)


+

α






u


S

u


i

->
j








f


(
u
)










(

3

A

)







Here uεi→j means that user u followed link i→j. According to one formulation, u reflects user meta-data which may include, but are not limited to, weight, recency, tenure, and time spent on a page, thus yielding:










w
ij

=



1
+

α






u


S

u


i

->
j








f


(


u
weight

,

u
recency

,

u
tenure

,

u

time





spent





on





j



)







deg


(
i
)


+

α






u


S

u


i

->
j








f


(


u
weight

,

u
recency

,

u
tenure


)






.





(

3

B

)







Yet another specific embodiment reflects a further generalization of this idea. That is, conditioning by a user segment may assume use of a step function that is equal to one for users within S and to zero for users outside S. However, it should be noted that this idea may be generalized to any probability distribution ρu (in practice we can assign different significance levels to different user segments), thus yielding:










w
ij

=



1
+

α






u

i

->
j









ρ
u



f


(
u
)








deg


(
i
)


+

α






u





i

->
j









ρ
u



f


(
u
)







.





(

3

C

)







It should be noted that embodiments of the invention may work on any level of aggregation (i.e., for blocked PageRank formulations). For example, for a site or host level graph, a link between site I and site J exists if there are pages i and j connected by a hyperlink such that iεI, jεJ. Now we can assign weights WIJ to the link I→J using a formula similar to any of (3)-(3C) with NIJ being a count of users who proceeded from any page i in site I to any page j in site J.


Because of “dangling” pages, i.e., pages having no out-links, and because of the requirement of a graph's strong connectivity (i.e., the Markov transition matrix P has to be irreducible), a degree of teleportation is added to the PageRank formulation of equation (1) as described above. And a typical teleportation distribution v=(vj) used in a conventional PageRank formulation is selected either uniformly or uniformly among a subset of trusted pages. As noted above, both approaches have shortcomings. That is, users do not start from obscure pages with the same probability as from popular hubs (e.g., think of the effect of bookmarks), and uniform teleportation actually leads to a link-based spam. On the other hand, what can be trusted is in dispute, and a restrictive definition of trust defeats the purpose of creating a strongly-connected graph.


Therefore, according to various embodiments of the invention, user data are utilized to meaningfully estimate a teleportation distribution for a PageRank formulation. Consider different user sessions. Each session has a first or a starting page. Let mj be the count of how many times a page j was a first page in a session. Then, according to a specific embodiment, a realistic teleportation distribution v′ can be defined as a blend of a more conventional distribution (e.g., v as defined above) with user-data-based component as follows:










v
j


=


β






v
j


+


(

1
-
β

)





m
j




j







m
j



.







(
4
)







where 0≦β≦1 is a tuning parameter which adjusts the degree of blending of the two components. Again, it should be noted that β=1 corresponds to a conventional formulation of this component. A higher β means a larger degree of exploration and a lesser degree of relying on behavioral data. According to one exemplary embodiment, β=0.2 is recommended as a reasonable tradeoff. It should be noted that equation (4) can be generalized in a manner similar to the generalization of equation (3) to equations (3A)-(3C) to incorporate user network utility, user tenure, recency, and time spent on a page. Even, if relatively few pages on the Web actually have a non-zero count mj, the idea leads to a good teleportation distribution with a small β accounting for a degree of exploration. The fact that only a small fraction of pages on the Web would have significant teleportation component agrees with the well known fact that a small portion of pages actually carries the bulk volume of PageRank distribution. Again, in deriving this teleportation distribution, we can take into account many other characteristics beyond frequency counts as was done for equations (3A)-(3C). The above-described embodiments suggest simple yet powerful frameworks for addressing two of the faulty assumptions underlying conventional PageRank formulations, i.e., uniform link weighting and uniform teleportation. According to further embodiments of the invention, another shortcoming of conventional PageRank formulations, i.e., the teleportation coefficient c, is addressed. Previously, it has been assumed that given a particular page, a random surfer “becomes bored” and jumps or “teleports” to a new session (i.e., at a new page) with uniform probability (1−c). In reality, uniformly assuming this dropout rate is a very bad approximation. Therefore, according to various embodiments of the invention, user data are utilized to estimate individual teleportation coefficients for specific pages or blocks. Let gi be a fraction of sessions that end on the page i of all sessions containing i. Then, according to a specific embodiment, a page-specific estimate of a dropout rate may be given by:





(1−ci)=(1−c)γ+(1−γ)gi   (5)


where c is a conventional teleportation coefficient, and 0≦y≦1 is a tuning parameter which enables varying degrees of blending of conventional teleportation coefficients with page-specific data. Here γ=1 corresponds to a conventional formulation with γ=0.25 being a reasonable default.


As discussed above, equations (3), (4), and (5) compute quantities related to PageRank formulations with reference to data corresponding to actual user behavior. In addition, further generalizations make it possible to account for other elements of user behavior such as, for example, user network utility, user recency, user tenure, time spent on a page, etc., e.g., equations (3A)-(3C). However, because the confidence levels for user behavior estimates relating to infrequently visited pages are low, some regularization may be desirable for specific embodiments of the invention.


It can be argued that the fraction of pages for which user data are available is small in comparison with the realm of all Web pages. Were it not so, the count of visits per page would serve as a good approximation of authority. Therefore, as described above, embodiments of the invention utilize authority propagation from conventional PageRank formulations while deriving out-link weights, teleportation vectors, and teleportation coefficients based on user behavioral data, thus blending these two types of data to varying degrees. Thus, embodiments of the invention provide more accurate PageRank authority of all pages, including pages that have little or no visitation.


Put another way, embodiments of the present invention, consolidate conventional formulations applicable to any pages with new formulations applicable to relatively few frequent, and so high authority, pages. According to some of the exemplary formulations described herein, this consolidation may be achieved to varying degrees using a kind of Laplace smoothing represented in equations (3)-(5) by parameters α, β, γ. For α=0 and β=γ=1 the formulations are reduced to the conventional formulations represented by equation (1). On the other hand, if any one of these three parameters departs from these values, some level of blending occurs and is therefore within the scope of the invention. Thus, it should be noted that embodiments of the invention are contemplated in which these tuning parameters range in value such that only one, two, or all three of the corresponding components are in play.


Further refinements and applications of the present invention will now be described.


User Segment Personalized PageRank


Many attempts have been made to define personalized PageRank formulations. For example, by selecting a narrow set of topic specific pages and restricting teleportation to these pages, a topical PageRank formulation can be constructed. According to specific embodiments of the present invention, PageRank formulations (or individual components thereof) derived in accordance with the present invention may be flexibly and straightforwardly applied to or used with any type of personalized PageRank formulation.


For example, user segmentation is commonly used in targeted advertising. A user segment can be defined in terms of a user demographic profile (e.g., age, gender, income, etc.), user location, user behavior, etc. Any or all of equations (3)-(5) above can then be specified to reflect any such user segment in that they are constructed with reference to user data corresponding to an underlying population which, in turn, can be restricted to the relevant user segment. Moreover, as discussed above, such formulations can take into account any probabilistic distribution of user relevancy such as, for example, assigning weights to different users on the basis of an age range distribution.


Blocked PageRank


As discussed above, PageRank formulations are often applied to aggregations at the host, site, or domain levels, often referred to as blocked PageRank. Blocked PageRank is useful in acceleration of PageRank computing and in PageRank personalization. To construct a blocked PageRank formulation, parameters for a factorized directed graph are defined. For example, equal weights may be assigned for any link from one block to another as between two blocks having nodes connected by a directed edge. However, such a formulation would not distinguish between a pair of blocks connected by a single spurious link, and a pair of blocks connected by multiple direct edges. A variety of schemes have been developed to derive weights for block super-edges, but performance in practice has yielded mixed results.


However, because user behavior naturally aggregates at the various different “block” levels (i.e., site, host, domain, etc.), the various PageRank formulations of the present invention naturally scale up to the various block levels.


Overall PageRank Iterations


PageRank computing is related to the so-called simple power iteration method. This method depends on parameters such as edge probability distribution and teleportation described above. Equations (3)-(5) above and the generalization exemplified by equation (3A) thus lead to the following:










p
j

(

n
+
1

)


=





i
->
j









c
i



w
ij



p
i

(
n
)




+


(

1
+

c
i


)



v
j







(
6
)







where transition weights wij are defined by equation (3) or its analogs (e.g., equations (3A)-(3C)), teleportation distribution vj is defined by equation (4) or its analogs, and teleportation coefficients ci are defined by equation (5) or its analogs. And any derived iterative schemes that accelerate PageRank convergence and/or construct or compute blocked PageRank which employ any of the PageRank formulations or components thereof described herein are within the scope of the present invention.


Time Dynamics


In principle, PageRank should be periodically recomputed because the Web graph grows and its topology changes with time. In line with this is purely topological change, core pages with the same in and out-links still come in and out of fashion or significance over time. This is particularly important given that there is no “garbage collection” on the Web. Yet another advantage of the PageRank formulations of the present invention is that it is relatively straightforward to incorporate time dynamics. For example, a discount procedure such as, for example, exponential averaging, could readily be included into user behavior counts to emphasize recent events and discount old ones. Not only does such a modification capture temporally dependent changes in page popularity, it also operates as a de-facto Web garbage collection utility.


Other Applications


As will be understood, the various PageRank formulations of the present invention may be used in conjunction with other information to evaluate page relevance in ranking search results according to any of a wide variety of techniques. However, it should be noted that the PageRank formulations of the present invention may be used in a wide variety of other applications. An example of one such application is controlling the manner in which a web crawling application crawls the Web. That is, the PageRank formulations of the present invention may be used to support decision making by a web crawler to determine whether and on which links associated with a given page to crawl.


Moreover, the basic principles described herein can be generalized beyond PageRank formulations. Consider an anchor-text that is known as one of the most useful features used in ranking retrieved Web search results. It is usually assembled through aggregation of different \href HTML tag text strings related to incoming links. However, since incoming links have different popularity, this text can be supplied with some weights derived according to the present invention. According to the invention, knowledge of user behavior may be incorporated into such a technique as follows. Given a target page j, anchor-texts corresponding to incoming links i→j are weighted with user behavior scores wij computed as described above. As will be understood, various formulas may be used in relevancy ranking to aggregate hyperlink anchor text. Any of those formulas may be modified in accordance with the present invention to reflect link weights corresponding to user behavior in a manner similar to equations (3)-(3C).


Embodiments of the present invention may be employed to compute PageRank or similar formulations in any of a wide variety of computing contexts. For example, as illustrated in FIG. 2, implementations are contemplated in which the relevant population of users interact with a diverse network environment via any type of computer (e.g., desktop, laptop, tablet, etc.) 202, media computing platforms 203 (e.g., cable and satellite set top boxes and digital video recorders), handheld computing devices (e.g., PDAs) 204, cell phones 206, or any other type of computing or communication platform.


And according to various embodiments, user data processed in accordance with the invention may be collected using a wide variety of techniques. For example, collection of data representing a user's interaction with specific Web pages may be accomplished using any of a variety of well known mechanisms for recording a user's online behavior. However, it should be understood that such methods of data collection are merely exemplary and that user data may be collected in many other ways. For example, user data may be collected when a user registers with, for example, a particular web site or service.


Once collected, the user data are processed and stored in some centralized manner. This is represented in FIG. 2 by server 208 and data store 210 which, as will be understood, may correspond to multiple distributed devices and data stores. The invention may also be practiced in a wide variety of network environments (represented by network 212) including, for example, TCP/IP-based networks, telecommunications networks, wireless networks, etc. In addition, the computer program instructions with which embodiments of the invention are implemented may be stored in any type of computer-readable media, and may be executed according to a variety of computing models including a client/server model, a peer-to-peer model, on a stand-alone computing device, or according to a distributed computing model in which various of the functionalities described herein may be effected or employed at different locations.


While the invention has been particularly shown and described with reference to specific embodiments thereof, it will be understood by those skilled in the art that changes in the form and details of the disclosed embodiments may be made without departing from the spirit or scope of the invention. In addition, although various advantages, aspects, and objects of the present invention have been discussed herein with reference to various embodiments, it will be understood that the scope of the invention should not be limited by reference to such advantages, aspects, and objects. Rather, the scope of the invention should be determined with reference to the appended claims.

Claims
  • 1. A computer-implemented method for generating an authority value of a first one of a plurality of documents, comprising: generating a first component of the authority value with reference to outbound links associated with the first document, the outbound links enabling access to a first subset of the plurality of documents;generating a second component of the authority value with reference to a second subset of the plurality of documents, each of the second subset of documents representing a potential starting point for a user session;generating a third component of the authority value representing a likelihood that a user session initiated by any of a population of users will end with the first document; andcombining the first, second, and third components of the authority value to generate the authority value;wherein at least one of the first, second, and third components of the authority value is computed with reference to user data relating to at least some of the outbound links and the second subset of documents.
  • 2. The method of claim 1 wherein generating the first component of the authority value comprises assigning a weight to each of the outbound links, each of the weights being derived with reference to a portion of the user data representing a frequency with which the corresponding outbound link was selected by a population of users.
  • 3. The method of claim 2 wherein the plurality of documents may be represented by a graph, and each of the weights represents a likelihood that a user will traverse a directed edge of the graph associated with the corresponding outbound link.
  • 4. The method of claim 2 wherein the population of users corresponds to a segment of a superset of users, the segment being selected from the superset of users with reference to at least one of demographic data and behavior data.
  • 5. The method of claim 4 wherein the demographic and behavior data relate to any of user importance, user recency, user tenure, and user time spent.
  • 6. The method of claim 2 wherein each of the weights includes a constant nonzero component derived with reference to a number of the outbound links.
  • 7. The method of claim 2 wherein each of the weights is further derived with reference to a probabilistic distribution associated with the population of users.
  • 8. The method of claim 1 wherein generating the second component of the authority value comprises generating a teleportation distribution which includes a term for each of the second subset of documents, each of the terms being derived with reference to a portion of the user data representing relevance of the corresponding document among a population of users.
  • 9. The method of claim 8 wherein the relevance of the corresponding document is determined with reference to a frequency with which the corresponding document began a user session initiated by any of a population of users.
  • 10. The method of claim 8 wherein the population of users corresponds to a segment of a superset of users, the segment being selected from the superset of users with reference to at least one of demographic data and behavior data.
  • 11. The method of claim 10 wherein the demographic and behavior data relate to any of user importance, user recency, user tenure, and user time spent.
  • 12. The method of claim 8 wherein each of the terms includes a constant nonzero component derived with reference to a number of the second subset of documents.
  • 13. The method of claim 8 wherein each of the terms is further derived with reference to a probabilistic distribution associated with the population of users.
  • 14. The method of claim 1 wherein the population of users corresponds to a segment of a superset of users, the segment being selected from the superset of users with reference to at least one of demographic data and behavior data.
  • 15. The method of claim 14 wherein the demographic and behavior data relate to any of user importance, user recency, user tenure, and user time spent.
  • 16. The method of claim 1 wherein the third component of the authority value comprises a teleportation coefficient which includes a constant nonzero component.
  • 17. The method of claim 1 wherein the third component of the authority value comprises a teleportation coefficient derived with reference to a probabilistic distribution associated with the population of users.
  • 18. The method of claim 1 wherein each of the first, second, and third components of the authority value is generated with reference to the user data.
  • 19. The method of claim 1 wherein both of the first and second components of the authority value are generated with reference to the user data.
  • 20. The method of claim 1 wherein the first document comprises any of a page, a file, a site, a host, a domain.
  • 21. The method of claim 1 further comprising ranking the first document among a plurality of search results with reference to the authority value.
  • 22. The method of claim 1 further comprising facilitating decision making by a web crawling application with reference to the authority value.
  • 23. The method of claim 1 further comprising periodically regenerating the authority value to reflect changes in the user data.
  • 24. A computer-implemented method for generating an authority value of a first one of a plurality of documents, comprising: identifying text associated with each of a plurality of inbound links enabling access to the first document;assigning a weight to the text associated with each of the inbound links, each of the weights being derived with reference to user data representing a frequency with which the corresponding inbound link was selected by a population of users; andgenerating the authority value with reference to the weights.
  • 25. The method of claim 24 wherein the plurality of documents may be represented by a graph, and each of the weights represents a likelihood that a user will traverse a directed edge of the graph associated with the corresponding inbound link.
  • 26. The method of claim 24 wherein the population of users corresponds to a segment of a superset of users, the segment being selected from the superset of users with reference to at least one of demographic data and behavior data.
  • 27. The method of claim 26 wherein the demographic and behavior data relate to any of user importance, user recency, user tenure, and user time spent.
  • 28. The method of claim 24 wherein each of the weights is further derived with reference to a probabilistic distribution associated with the population of users.
  • 29. The method of claim 24 wherein the first document comprises any of a page, a file, a site, a host, a domain.
  • 30. The method of claim 24 further comprising ranking the first document among a plurality of search results with reference to the authority value.
  • 31. The method of claim 24 further comprising periodically regenerating the authority value to reflect changes in the user data.