The invention relates to a device or method running on the device, executed by a processor of a computer system, to obtain a relative importance ranking of a subset of nodes of a larger, multiply linked set of nodes, wherein a link defines a reference from one node to another node, so that each node provides information about nodes that are linked; the ranking is based on a structure of the links between the nodes, and on link weights that determine the importance of said links.
The situation is that there are large computer networks (e.g. the world wide web) consisting of nodes (e.g. computer systems providing web pages of the www) and directed connections between those nodes (e.g. the links from one web page to another). The connections can be of different strength.
Importance ranking of entries of a large, multiply linked data base is an extremely important and common problem. For example, the small subset could be the result of a database query, or more specifically the set of all the web pages containing the word ‘motor’. The importance ranking will usually be based on the network structure. Most common ranking mechanisms make direct use of the link structure of the network. Arguably the most famous instance is the original algorithm of Google which ranks the web sites of the world wide web. The mechanism of that algorithm can be described as follows:
A ‘random walk’ is set up on the linked network, in the sense that when a ‘walker’ is currently at a node x of the network, it jumps to a random, different node in the next step. The precise node to which it jumps is usually selected using the link structure of the network. In the most basic implementation of Googles algorithm, the walker first throws an unfair coin, which comes up heads with (usually) probability 0.15 (Conceptually in an unbiased or fair coin both the sides have the same probability of showing up i.e. ½=0.50 or 50% probability exactly. Wherein within a biased or unfair coin probabilities are unequal. That is any one side has more than 50% probability of showing up and so the other side has lesser than 50% chances of turning up). If it does come up heads, the walker chooses a node from the whole network at random (each with equal probability), and goes there. If the coin comes up tails, the walker chooses at random, with equal probability, one of the nodes which is connected to its current location by an http link. For each node, the amount of times that the walker jumps to this node is divided by the total number of steps taken. General mathematical theory assures that these quantities approach a limit when the number of steps becomes very large. This limit is a number between 0 and 1, and serves as the score of the given node. The importance ranking now is such that among the hits of a given search query, nodes with higher scores are displayed first.
An equivalent formulation of this mechanism is that the links between the nodes carry a certain weight. In the above example, it is assumed that method is at a node x which has links to a total of 7 other nodes in the network, and that there is a total of N nodes in the network. Then each of the 7 links leading from x to another node obtains a weight of 0.85/7, and a system or a user introduce additional links from x to every node of the network with a link of 0.15/N each. Note that all the link weights sum up to one. Thus, an algorithm that picks a weighted link from the totality of links leaving x and follows it will lead to the same random walk as in the Google algorithm.
The formulation using link weights offers more flexibility than the original Google algorithm. In particular, since the link weights directly influence the resulting ranking, one can think of deliberately strengthening certain link weights at the expense of others, simulating a random walker that prefers certain jumps to others. For example, instead of the links with strength 0.15/N to the whole network, the invention may pick a subset of the network, e.g. containing M nodes, and only introduce links from each x to that subset with a probability of 0.15/M. This approach becomes more powerful when the subset can depend on the search result and/or user interaction, allowing the user to indirectly control the criteria by which the ranking is obtained. An obstacle to this idea is that in very large networks, it may take a very long time to compute a ranking using (the equivalent of) Googles algorithm, making it impossible to obtain user specific rankings in real time. A central point of the current invention is a method and system that makes this real time computation possible.
There are several extensions and refinements of the Google algorithm. One has to do with the removal of ‘dangling nodes’ which are nodes that have no outgoing links (“Googles Search Engine and Beyond: The Science of Search Engine Rankings” by Langville and Meyer, published 2012 by Princeton University Press. For the policy of ranking down non https sites, see e.g. https://webmasters.googleblog.com/2014/08/https-as-ranking-signal.html). Others have to do with changing the mechanism with which the random walker picks its next step. For example, it is well known that web sites not meeting certain criteria (like e.g. offering secure communications) are ‘ranked down’, which may be done by decreasing the probability of the random walker to visit these web sites.
Existing ranking algorithms usually fall into two categories:
Both approaches only work based on the results of the given search query, without taking any other part of the network into account. However, this may lead to very different, and potentially less good, importance ranking. For example, it is assumed that the user searches for the term ‘gift’ (which means ‘poison’ in German) and obtains a result set containing English and German web sites. If the user is only given the option to rank up all search results that are in German, but based on the ‘standard’ ranking obtained from a random walker on the whole web, the results may be very different from a ranking that would be obtained by strengthening all the link weights between German language web sites, and at the same time weakening links to English language web sites; the latter would correspond to a walker that is only allowed to (or will strongly prefer to) visit German language sites in the first place.
The disadvantage of approach (1) above is that it is not very flexible and takes a long time to compute. In particular, it is not possible to include user defined ranking criteria in real time. Instead, the system needs to anticipate some common requests (such as preferring web pages in German) and to pre-compute a ranking for them. This means that an interactive re-ranking of search results is impossible or at least rather limited. Secondly, even a small change in the network structure usually needs a full re-computation of the importance ordering. So, problem b) below is not solved by the method.
The approaches of (2) (disclosed in U520150127637 A, U52008114751 A, U52009006356 A, U.S. Pat. No. 6,012,053B, U.S. Pat. No. 8,150,843B are trying to solve problem a) and to provide real time interactive methods of re-ranking. Their limitation is that they only consider the subset that was e.g. returned by the search query. However, this subset will usually be embedded into a larger neighborhood that may influence its ranking strongly. In the presented example, web pages containing the word ‘car’, ‘airplane’ or ‘lawn mower’ might be strongly linked to the web pages containing the ‘motor’ and would influence their ranking. Given the many possible search terms, it would however be impractical to amend the methods of (2) by considering ‘enlarged’ sub-networks, since the choice of these networks would be difficult to make. So, problem a) cannot be solved in a satisfactory way by the existing approaches, and they do not offer any way to solve problem b).
In real world situations, it is often desirable to let the user select the link weights between the nodes, and then determine the importance ranking that results from the new link structure. Here are three examples:
1. One may give the user the chance to decide whether they want to rank down non-secure web sites, and by how much. In this case, the weight of any link to a non-secure web site would be reduced by a user specified amount.
2. One may give the user the opportunity to strengthen link weights between web sites written in a certain language only, and at the same time perform the ‘random’ jump (corresponding to the 0.15 side of the unfair coin) only to web sites in that language.
3. The data base may constantly change by addition of new nodes and/or links, and it is intended to keep it up to date in real time.
In all three examples, what has to be done is to order (or re-order) a small subset of the database according to an importance ranking that cannot be pre-computed, and thus has to be determined in real time. The reason for not being pre-computable is that the importance ranking will depend on the link strengths that are given by the user, or by the change in the network structure. Also, since it is not reasonable to assume that the random walker will only walk on the small subset that is returned by the search query, the new importance ranking will depend on strengths of links between sites which are not in the small subset that needs to be ranked.
In summary, it would be desirable to have a method that achieves the following: when given a subset of nodes that should be compared, and a set of weighted links between all nodes of the network, the method returns (a good approximation of) the importance ranking that the random walker algorithm would give with the set of weighted links. This should happen fast enough to be suitable for real time applications.
The invention presents a system and method to provide an importance ranking for a small subset of nodes of a large linked network, where each link can have a strength attributed to it. The calculation has to be performed with a high speed. The system receives as input a subset of relevant nodes/elements, and a full set of link strengths. It provides an importance ranking that is similar to the one that would be obtained by running the ‘random walker’ algorithm on the whole linked network with the given linked weights. It will only compute the importance ranking for the relevant small subset of nodes, and therefore can be much faster than existing approaches. At the same time, it uses more information of the network structure than is contained in the small subset of nodes and the links connected directly to them. This way, it can possibly give more useful rankings than methods based on the information in the small subsets alone.
This invention allows
One first idea of the invention is that instead of computing the importance weights of all the nodes in the network, the invention computes importance ratios between different nodes. With the help of the importance ratios for a small subset of nodes, both tasks a) and b) can be achieved.
The basic working mechanism of the invention is to use reachability as a measure for comparing the importance of two nodes. Let p(a->b) be the probability that a random walker starting from a node a reaches node b before returning to node a. Broadly speaking, the idea of reachability is that node a is more important than node b if it is easier to reach a when starting from b than the other way round. In other words, a is more important than b if p(b->a) is bigger than p(a->b). The difference of importance can be quantified, as will be explained below.
Assume that one wants to compare the importance of two nodes a and b. Their ‘true’ importance weight is given in the sense of the Google algorithm, i.e. via the network structure by the stationary distribution of the Markov chain induced by the network. Assume that w(a) is the true importance weight of a and w(b) is the true importance weight of b. In other words, one could calculate w(a) and w(b) by running the Google algorithm for the full network. It is a result of the paper [1]: (reference: V. Betz and S. Le Roux, Multi-scale metastable dynamics and the asymptotic stationary distribution of perturbed Markov chains, Stochastic Processes and their Applications 26 (ii), November 2016) that the ratio w(a)/w(b) is exactly equal to the ratio r(a,b)=p(b->a)/p(a->b). Therefore, r(a,b) will be called the importance ratio of the nodes a and b.
One second idea of the invention is that while it is difficult to approximately compute w(a) without at the same time computing the weights for all other nodes of the system, it is easy to approximately compute p(b->a) and p(a->b) based only on local information. This provides a faster way to calculate an (approximate) importance ordering for selected database entries without at the same time calculating the weights of the whole remaining network.
The invention provides an explicit recipe for computing the approximate importance ratios. The claimed approximate method of computing importance ratios works by starting a ‘random surfer’ at a and let it perform a random walk along the network guided by the links and link weights between the nodes, in the same way that the Google algorithm does. Explicitly, when the surfer is at a node c, it will go to another node d with probability proportional to the link weight of the link between c and d (which may be zero). For this algorithm, it is not necessary that the link weights sum up to one: assume that there are links to the nodes d(1), . . . d(n) of the network, with corresponding link weights s(1), . . . s(n). Then the surfer will choose the node d(i) with probability p(i)=d(i)/(d(1)+. . . +d(n) . . . The difference to the Google algorithm is that in Googles algorithm the surfer has to explore the whole network for a very long time, and the importance weight of the node a is then the proportion of time the surfer spent on a. In contrast, the claimed algorithm stops as soon as the surfer starting from a either hits the node b or returns to a. If it hits b, it is defined that the surfer made a successful journey, if it returns to a it is defined that the journey was a failure. The method repeats this procedure many times (or runs it in parallel, which improves also the speed and the computing efficiency) and records the success ratio J(a,b), i.e. the number of successful journeys divided by the number of all journeys. The method then starts a surfer from b and record the ratio J(b,a) of successful journeys from b to a. The quotient R(a,b)=J(b,a)/J(a,b) of journey success ratios then approaches the importance ratio r(a,b)=p(b->a)/p(a->b) by the Ergodic Theorem. Since by the results of [1], r (a,b)=w(a)/w(b), computing J(b,a)/J(a,b) provides an approximation of the importance ratio without exploring the whole network. The approximation can be quite good even after a relatively short computing time: in most cases, the journey will either be a success or a failure long before the surfer explores the whole network. Thus, the claimed algorithm is much faster than the standard Google algorithm.
If a journey does take too long before completing, there are several reasons why this might happen, and several measures the method can take. In some cases, the surfer can get caught in a relatively small area, let it be called E, of the network that is hard to leave. In these cases, the one can compute the occupation ratios v(c) for each node c of E by dividing the number of visits to c by the total number of steps in E. Then, an effective rescaling of running time as described in [1] is possible: one has to use formula (4.6) in the cited reference, with the difference that one replaces the quantities nu_E(c) given there by the approximate quantities v(c). That formula gives an explicit set of alternative weights that can be used to jump from any point in E directly to one of the points on the boundary of E. It is sensible to store these alternative weights for future random walkers and to use them immediately if one of them enters of re-enters E. It is also possible (and sensible) to use this method recursively if necessary.
Another possible source of failure of the method is that the success rate is too small, meaning that most journeys starting in a also end in a. If this is the case, one has to use formula (4.5) of [1], where x is the starting point a, and again the nu_E(c) are replaced by the approximate occupation ratios v(c). Again, this change should be recorded for future walkers, and may need to be applied recursively.
Finally, it is possible that the walker runs for a long time without finding either a or b, even after rescaling. In this case, the method cancels the journey completely and does not count it towards the total number of journeys. The number of cancelled journeys should be recorded as well, as it can give a measure of accuracy for the resulting ranking: a high number of cancelled journeys suggests that the computed ranking may be of low quality.
In the following, some more variants and features of the model are given:
In the following application scenarios and examples of the invention are given.
1) The interactive re-ranking based on user-defined criteria as mentioned above is handled as follows: It is assumed that a search query returned a subset A of nodes. It is also assumed that a criterion changing the strengths of the connection between arbitrary nodes of the network is given. This criterion may be specified by the user, or it may depend on the results of the search query. In addition to the examples given in the previous section, a possible change would be the following:
In Googles original algorithm, the ‘random surfer’ jumps to a completely arbitrary place in the world wide web with probability 0.15 in every step. A user or an external system could replace this by the prescription that the random surfer jumps to an arbitrary element of a given subset B of nodes. B can be the set A of results from the search query, can contain A but be larger than A or can be entirely unrelated to A, e.g. all sites in a given language. This way, the web graph would depend on the search result. If B contains A, the chance of the surfer getting permanently lost, i.e. the number of cases where a journey is neither a success nor a failure after many steps, is greatly reduced.
The definition of the subset can be based on several technical parameters as language preferences setup in the computer, the geo-location, or the usage history stored on the computer etc.
In another context, a content blocker (e.g. parental control) can decide to not only block given sites, but also weaken connections to sites that are either forbidden or heavily linked to forbidden sites, so these become harder to find, and their ‘opinion’ counts less when ranking the allowed sites.
On a mobile device, certain search services (like restaurant search) may decide to strengthen links based on geo-location of the device, detected by sensors, or address data provided in the web site.
With the new set of weighted connections, each of the search results a has a new importance weight w(a). Items with relatively large w(a) should appear at the top of the results list. The weights w(a) cannot be pre-computed as they depend on the search results or the user interaction. They could in principle be computed by running a Google algorithm on the full network, but this is too slow for real time. Instead, the fast comparison algorithm proposed by the invention computes some or all of the approximate reachability ratios R(a,b). Then, it is now possible to compare the importance of a node a with the importance of another node b: if r(a,b)>1, then a is more important than b. The final order in which the search results appear can now be determined by standard sorting algorithms, using that comparison.
2) Fast maintenance of the ranking after small changes in the network structure (see above b)). It is assumed that the network has changed at a node a, in that there is a new connection from a to some other nodes. This will change the importance weight of a, but also of some of the neighboring nodes. To compute the change, the method of the invention starts random surfers from the node a and determine a set A of nodes where the change of a may have effects. One way to do this is to simply start many random surfers from a and let A+ be the set that a good number of them reaches in the first hundred steps or so. A+ is the set that can be easily reached from the node a. The method of the invention should or can also run this process on the inverse network where every connection of the original network nodes in the reverse direction. This will give to the invention a set A− of nodes from which it is easy to get to a. A will then be the union of A+ and A−. It is known that the method of the invention calculates the ratios R(c,d) for all c,d in the set A, and determines the new weights by these ratios. One possible way the method is implementing this is to find nodes in A that are relatively weakly connected to a. A node b is weakly connected to a if there is a node c outside of A such that the both the fractions J(b,c) and J(c,b) of successful journeys from b to c and back are much larger than the corresponding ratios from a to b. Such a node will only be weakly influenced by changing the network around a, and it can thus be assumed that the weight of b stays the same. The remaining weights of nodes in A can then again be computed by their ratios with w(b), and their mutual ratios offer a consistency check such as given in (ii) above. The new weights will then replace the old ones.
According to the claims an important aspect of the invention is a method, executed by a processor of a computer system, to obtain a relative importance ranking of a subset of nodes of a larger, multiply linked set of nodes, wherein the nodes and links form a network, and wherein the links may have link weights attributed to them, wherein a link defines a reference to information stored on the nodes, so that each node provides information that are linked, the ranking is based on a structure of the links between the nodes, and is based on link weights that determine the importance of said links, comprising following steps:
In a preferred embodiment the reachability score is computed in the following way:
According to the claims an important aspect of the invention is a method, executed by a processor of a computer system, to obtain a relative importance ranking of a subset of nodes of a larger, multiply linked set of nodes, which are connected by a network, wherein a link defines a reference to information stored on the nodes, so that each node provides information that are linked, the ranking is based on a structure of the links between the nodes, and is based on link weights that determine the importance of said links. A possible embodiment is computer network, wherein each computer provides a service on which linked data is stored. The computer or the services running on the computer represent the nodes.
The service can be a http service also known as a web-service, or ftp service or any other service providing linked information. The links refer to external data on other nodes or to local data. A possible link can be a Hyperlink. The nodes can be addressed by URLs and path. A subset of the nodes can be a search result derived from a search using a search engine, providing the addresses as URLs. Also databases can be sources of the subset, storing the addresses of the nodes in relation to certain content or other relevant information.
The method comprises the following steps:
In a possible embodiment the step of performing a random walker method for each pair a, b of nodes is computed in parallel. The computation can be performed on several servers within the network, or on a local client or a mixture of both.
The steps of the method can be calculated on a client computer, a server provides the necessary information about a network structure, and the client computer calculates the reachability score. This can be based on a copy of the network structure as mentioned above. It is also possible to perform this in the network itself if a fast access is given.
In a possible embodiment the prescription to compute link weights for a given node x is supplied by a user or by an external database,
In a possible embodiment the method comprises the step: Introducing a measure of quality into the ranking, defined by a record that stores how many journeys of the random walker have to be cancelled because neither a nor b is reached after a large predefined number of steps.
When displaying the measure of quality to the user the user is in the position to determine how much the ranking provided can be trusted. A high number of cancelled journeys is an indication, that the quality is minor. Wherein when a large number of journeys have been successful, then the quality might be high. The measure of quality can also be used to determine the order of nodes when there is a conflict in the sorting algorithm, which can happen as the reachability scores are not necessarily transitive due to the approximate nature of the computation.
In another embodiment the method can be used to maintain the relative importance ranking of a multiply linked set of nodes, when some entries or links are added or removed, comprising the step:
In one embodiment of the invention, the user provides the system with a data base query and additional parameters that can be used to modify the existing strengths of all the links in the network. The system will, in near real time, return the search results based on the query, ranked by the importance ranking based on the link strength parameters given by the user.
In a special case of the above embodiment, one can strengthen all the links between each pair of search results, and also restrict the completely random jumps that are sometimes done (those that do not follow any of the connections) to jumps between two search results. This is in a similar spirit as the existing local-Rank algorithm by Google, but is more powerful since it is not restricted to the search results alone. Also, the user can be allowed to choose the strength of the additional connections.
In another embodiment, the user sends a search query, and the system returns a result ranked by a standard importance ranking. The user is then given the opportunity to change certain parameters, and the system changes the order of the search queries in real time, depending on the parameters.
In yet another embodiment, one or more nodes, and/or additional links, are added to the network. The aim is to compute the (standard) importance ranking of the new nodes, and possible impacts on the existing nodes, in real time, in order to keep the system up to date. For this, in a first step for each new node or node with a new link, a ‘neighbourhood’ of other nodes is determined, and then a relative importance ranking is obtained by the reachability method, giving a new importance score for new nodes, and an updated importance score for existing nodes.
One advantage of the reachability approach is that if given a pair of nodes, one can compute the relative importance of those two nodes based on only the network structure near those nodes. This makes it possible to obtain an importance ranking for relatively small subsets of the data base (such as results of search queries) in real time, whereas traditional methods need to rely on pre-computed importance rankings.
In the following two implementations are shown where the feature of reachability with respect to near nodes can be very useful.
Examples would be that the user wants to strengthen links between web sites hosted by universities, or that the user decides to weaken links to web sites that receive many links (‘avoid crowded places’). In addition, it is possible that the new link weights make use of a standard set of link weights, a modification of link weights coming from the search query of the search results, or a combination thereof. These new link weights, along with the nodes that the user wants to be ranked (e.g. the results of their search query) are then given as an input to the ranker according to the invention. The ranker computes a ranking based on the data given by the user, and gives the ranked results to a presentation manager. As explained below, the ranker can not only rank search results, but also give an assessment about the quality of the relative rankings, so that the presentation manager can tell the user how much more important one data base entry is compared to the other, and how large the margin of errors is in the given ranking. This may or may not be presented by the presentation manager, depending on the circumstances and the user preferences. Seeing the results, the user can then give a feedback about the quality of the search results and change the weights accordingly. The system will update the ranking with the new weights. This procedure can continue until the user is satisfied with the ranking.
Two implementations are possible:
The ranker system can be run on the users machine using a suitable computer program or browser applet. In this case, the database system will communicate the local network structure to the user's machine, which will then do the ranking. This implementation has the advantage that it is scalable, in the sense that it can be used by many users at the same time, Or the rank system can run on a powerful and highly parallel architecture on the server side, which has the advantage that it can be very fast for a single user, since the method is very well suited for parallelization.
After applying the pair selector, one obtains a set of pairs. For each of these pairs, the relative importance ranking is computed as described in
If the set of scores is consistent and complete, it is returned as a result of the method. If it is not consistent and complete, a check is performed whether the run time is exceeded. If not, new pairs are added and/or the desired accuracy of the pair ranker is increased. If the run time is exceeded, the relative importance scores are returned, possibly along with warnings about those scores that are of low quality.