Embodiments of this invention relate to multi-agent systems, in particular to policy evaluation in such systems.
Many real-world applications involve multi-agent system modelling, ranging from selecting the strongest artificial intelligence robot to play video games against humans, to developing a population of safe self-driving cars that will never collide with one another.
These tasks usually involve different types of interactions among the multiple agents involved. With multi-agent learning techniques becoming increasingly complicated, how to efficiently evaluate the effectiveness of each individual agent's policy, or the joint policy as a whole, in such multi-agent environments becomes a critical problem.
ELO Rating, as described in Arpad E. Elo, ‘The rating of chess players, past and present’, Arco Pub., 1978, is a common approach to policy evaluation that is widely used in ranking players in competitive games including football, chess and Go. Players with a higher ELO rating have a higher probability of winning a game than players with a lower ELO rating. If a player with a higher ELO rating wins, only a few points are transferred from the lower rated player, whereas if a lower rated player wins, then the number of transferred points from a higher rated player is far greater. However, a disadvantage of the ELO rating method is that it cannot deal with intransitive behaviours in the game. For example, the cyclical best-responses in a game of Rock-Paper-Scissors. Also, in this game, a Rock-playing agent will attain a high ELO score if more Scissor-playing agents are introduced into a tournament, while the ground truth should be that Rock, Paper and Scissors are equally likely, regardless of how many times Rock is played.
Computing the Nash equilibrium, as described in David Balduzzi, Karl Tuyls, Julien Perolat, and Thore Graepel, ‘Re-evaluating evaluation’, Advances in Neural Information Processing Systems, pages 3268-3279, 2018, and John F. Nash et al., ‘Equilibrium points in n-person games’, Proceedings of the National Academy of Sciences, 36(1), 48-49, 1950, is a classical solution concept in Game Theory for strategy evaluations. However, targeting at Nash equilibrium is problematic in many ways. For example, it is not guaranteed that any dynamical game can converge to a Nash equilibrium. It is also known that solving for a Nash equilibrium in general-sum games is Polynomial Parity Arguments on Directed graphs (PPAD)—complete, which seriously limits the complexity of the multi-agent system that can be evaluated. It is also unclear how to select the equilibrium among the many Nash equilibria of a game.
Replicator dynamics from Evolutionary Game Theory, as described in Peter Schuster and Karl Sigmund, ‘Replicator dynamics’, Journal of Theoretical Biology, 100(3): 533-538, 1983, are often deployed to capture the micro-dynamics of interacting agents in the multi-agent system. These approaches focus on studying the basins of attraction and equilibria of the evaluating agents' strategies. However, they are limited, as they can only be feasibly applied to games involving few agents. Furthermore, this approach cannot deal with games with asymmetric payoff.
Alpha-Rank, as described in Shayegan Omidshafiei, Christos Papadimitriou, Georgios Piliouras, Karl Tuyls, Mark Rowland, Jean-Baptiste Lespiau, Wojciech M Czarnecki, Marc Lanctot, Julien Perolat, and Remi Munos, ‘Alpha-Rank: Multi-agent evaluation by evolution’, Nature: Scientific Report, 2019, enables an evolutionary analysis of multi-agent interactions for asymmetric games by introducing Markov-Conley Chains. However, this approach runs in polynomial time and space with respect to the total number of pure strategy profiles, which prohibits its usage for real-world applications. When designing a self-driving car solution for a community of ten cars, each car having only two strategies to choose from, the time complexity to find the best joint strategy using Alpha-Rank would be =(230), which is intolerable. This method therefore cannot be scaled to tackle multi-agent problem with large population size.
It is desirable to model the inherently dynamical behaviors of a multi-agent policy evaluation problem, in both simulation engines and the real-world environment, in a way that can be scaled up to tackle large-scale policy evaluation problems.
According to one aspect there is provided a computer-implemented policy evaluation system for identifying an optimal combination of operational policies for implementation by a plurality of policy-driven actors, each actor being capable of adopting any of a plurality of operational policies, the policy evaluation system being configured to iteratively perform the following steps: (i) selecting a first combination of operational policies, the first combination defining a policy for each actor; (ii) receiving vectors of values, each representing the benefit of a respective second combination where all but one of the actors adopts the policy defined for it in the first combination and that actor adopts a different policy; and (iii) estimating a ranking of combinations of policies for the actors in dependence on those values and a previously estimated ranking of combinations of policies for the actors.
The said set of values may be a row or column of the transition matrix of the joint-policy profile for all the actors. This can permit the system to iteratively converge on a ranking of all possible combinations of the policies that is optimum to within a predetermined threshold, with no need to list all those possible combinations.
The iteratively performed steps may comprise: (iv) assessing whether the ranking of combinations of policies for the actors has converged to a predetermined degree; and if that assessment is positive, assigning to each of the actors the policy defined for it in the highest-ranked combination in the ranking. Once the system has iteratively converged on an adequate solution that solution can advantageously be adopted by each actor implementing the respective policy as indicated for it in the highest ranked combination of the solution.
The step of estimating the ranking of combinations of policies for the actors may comprise estimating an adjustment from the previously estimated ranking of combinations of policies for the actors and applying that adjustment to the previously estimated ranking of combinations of policies for the actors. This can assist the system in reaching the eventual solution in an iterative way, potentially improving efficiency.
The system may be configured to compute each of the said values as:
and:
α is a ranking intensity
k is an index of the first agent
M(k)(sik,si−k) is the fitness/reward for agent k using policy sik while the other agents use policy si−k
m is a hyper-parameter representing the population size
Sk is the k-th joint combination.
The system may be configured to estimate the ranking of combinations of policies for the actors as:
where:
b is the said vector of values
l is a vector full of ones
x is a rank to be output
eta is a learning rate
epsilon is an accuracy parameter
The system may be configured to store the said set of values in a compressed form in which are specified: (i) the quantity of the most common one of the said values and (ii) the index and quantity of each of the said values that is not equal to the most common one of the said values. This can reduce memory consumption.
The quantity of each of the said values that is not equal to the most common one of the said values may be specified as a difference between that quantity and the quantity of the most common one of the said values. This can reduce memory consumption.
In each iteration it may be that the only values in dependence on which the preferred combination of policies for the actors is estimated are the sets of values each representing a benefit of a respective second combination where all but one of the actors adopts the policy defined for it in the first combination and that actor adopts a different policy. This can simplify the computation process.
In each iteration it may be that the only values in dependence on which the preferred combination of policies for the actors is estimated is the values of the respective row or column of the transition matrix of the joint policy profile for the actors. This can simplify the computation process.
The actors may be policy-driven actors. The actors may be at least partially autonomous vehicles and the policies may be driving policies. M may indicate a representation of the fitness for a combination of joint driving policies. The actors may be capable of interacting with each other in implementing the policies. In this way, the highest ranked combination can result in relatively efficient operation of the actors as a group.
The step of selecting an actor may be performed stochastically. This can allow for an optimal combination to be found efficiently.
According to a second aspect there is provided a computer-implemented method for identifying an optimal combination of operational policies for implementation by a plurality of actors, each actor being capable of adopting any of a plurality of operational policies, the method comprising iteratively performing the following steps: (i) selecting a first combination of operational policies, the first combination defining a policy for each actor; (ii) receiving vectors of values, each representing the benefit of a respective second combination where all but one of the actors adopts the policy defined for it in the first combination and that actor adopts a different policy; and (iii) estimating a ranking of combinations of policies for the actors in dependence on those values and a previously estimated ranking of combinations of policies for the actors.
According to a third aspect there is provided a data carrier storing in non-transient form a set of instructions for causing a computer to perform the method set out in the preceding paragraph.
The present invention will now be described by way of example with reference to the accompanying drawings.
In the drawings:
Embodiments of the present invention concern an evolution-principled method to evaluate and rank agents' policies in a large-scale multi-agent setting. The method produces a ranking over agents' joint policies that are favoured by the evolutionary dynamics by filtering out transient strategies that will go extinct in the long-term evolutionary interactions.
It is assumed that there is a community of agents ={1, . . . , N} and each agent i has a collection of strategies/policies S(i)={si1, . . . , siK
Without losing generality, it is assumed that the total number of driving strategies Ki can vary for different agents. It is assumed that they are the same for easy notation Ki=Kj, (i, j∈{1, . . . , N}).
Each agent has a payoff function M(i): Πi=1NSi→+. In other words, the payoff function of each agent i depends not only on the chosen strategy for agent i but also on the strategies of all other agents.
Assuming that each agent i chooses a strategy siq∈Si then let us denote
as the joint strategy profile, and the corresponding pa-offs by
where s−1=(siq
The goal is to provide a ranking algorithm that can offer the rankings over the set of all possible joint strategy profiles under evaluation and provide insights into these strategies, including their strengths, weakness, and the dynamics of survival in the sense of evolution.
For each agent i, a population is created comprising m copies of agent i. The game of policy evaluation can essentially be represented as a Markov Chain, where the state space is given as a collection of all joint strategy profiles, i.e. SMC=Πi=1NSi→+ and the transitional probability matrix C∈|S
M(k)(sik,si−k) is the fitness/reward for agent k playing strategy sik while the other agents play si−k. In Equation (2), a is defined to be the ranking intensity, the hyper parameters of the algorithm.
The constructed Markov Chain is irreducible and ergodic, hence there exists a limit distribution over the set of joint strategy profiles π∈+|S
which does not depend on the initial distribution v0. v is also a stationary distribution.
The stationary distribution v for each agent shows how much time, on average, the Markov Chain spends in the corresponding state. Hence, in order to pick the most preferable joint strategy profiles, s* is defined such that:
Moreover, the stationary distribution π allows for the ranking of each individual strategy siq for each agent i as follows. For each strategy siq∈Si the value of the following is computed:
rank(siq)=Σs−i v(siq,s−i) (5)
Based on this value, all of the strategies siq∈Si for agent i can be sorted. Computing the corresponding stationary distribution v of the transition probability matrix C can be cast as convex constrained optimisation problem:
As illustrated generally in
where the following relaxation of the equality constrained condition is utilised:
x
T1−1+∈≥0&1+∈−xT1≥0 (8)
for some accuracy parameter ∈>0. In Equation (7), α is a weighting parameter.
Iterative updates are then performed. During the iterative updates, both parameters ∈ and α are gradually reduced.
The above problem can be further cast as a finite sum minimization problem. Let bi denote the ith row of matrix CT−I, then:
As shown at 103 in
Therefore, a first combination of policies is selected, the first combination defining a policy for each actor (for example, each car in the population of self-driving cars). A vector of values is received, each representing the benefit of a respective second combination of operational policies where all but one of the actors adopts the policy defined for it in the first combination and that actor adopts a different policy. A ranking of combinations of policies for the actors is estimated in dependence on those values and a previously estimated ranking of combinations of policies for the actors. The ranking is iteratively updated by estimating an adjustment from the previously estimated ranking of combinations of policies and that adjustment is applied to the previously estimated ranking of combinations of policies.
Whereas in previous approaches, the whole of the transitional matrix needs to be pre-computed and stored, in present approach, the said set of values is a row or column of the transition matrix of the joint-policy profile, i.e. only one row or column of the matrix is required in each iteration.
In a preferred implementation, in each iteration, the only values in dependence on which the preferred combination of policies for the actors is estimated are the sets of values each representing a benefit of a respective second combination where all but one of the actors adopts the policy defined for it in the first combination and that actor adopts a different policy. Therefore, the algorithm may select all possible joint strategy profiles that differ by one agent. The outcome of the method is the optimal stationary distribution re, i.e. the preferred ranking for the joint strategies, as illustrated generally at 104 in
The whole evolutionary policy evaluation approach is summarised by Algorithm 2, shown in
Each agent's policy may be augmented and learned during the game, rather than being fixed at the beginning of the game. The whole transitional matrix does not need to be pre-computed and stored. The method can use a row or column of the transition matrix in each iteration. This can permit the system to iteratively converge on an optimum ranking of possible combinations of the policies.
When it is detected that the ranking of combinations of policies for the actors has converged to a predetermined degree, i.e. π=π*, each of the actors ca be assigned the policy defined for it in the highest-ranked combination in the ranking.
A detailed system diagram of how the proposed method may be applied is shown in
The general framework can deal with the policy evaluation problem for any multi-agent system. The method may therefore be deployed on any accessible multi-agent environment or an equivalent simulation engine that can take the agents' joint strategies as input and output each agent's fitness/reward value. With such an environment provided, by following the computation flow of the system diagram shown in
The policy evaluation method may therefore result in a significant amount of reduction in computation in both time and space in computing the stationary distribution π*, which is essentially the required ranking. In particular, the method described herein does not need to pre-compute and store the whole transition matrix, which could be prohibitively large. By adopting a stochastic optimization method with an augmented Lagrangian, only one column of the transition matrix is accessed at each iteration. The number of iterations needed are also much less than traditional linear algebra methods or power methods. This method is therefore directly applicable to policy evaluations for large-scale multi-agent systems.
As mentioned previously, in one example, the actors are self-driving cars and the policies are driving policies. In this example, M in Equation (2) indicates a representation of the fitness for a combination of joint driving policies. The cars are capable of interacting with each other in implementing the policies. In this way, the highest ranked combination of policies can result in relatively efficient and safe operation of the cars as a group.
The results of comparative experiments are shown in
In summary, described herein is an evolutionary dynamics methodology to evaluate agents' policies in large-scale multi-agent interactions. The method is able to model the inherently dynamical behaviours of the multi-agent policy evaluation problem, for example for a population of self-driving cars. Unlike traditional methods that suffer from conditionality problems when the total number of joint strategy profiles is large, the proposed algorithm can be scaled up to tackle large-scale multi-agent policy evaluation problems.
The method uses stochastic sampling on the transition matrix to fully leverage the sparsity of the ranking algorithm. Therefore, it can be applied to large-scale populations of agents and result in less computational time and more efficient memory usage. Implementations of the present invention may achieve 1000 times faster speed and significantly less memory consumption compared to traditional numerical libraries such as Python, Numpy & Scipy, and traditional methods such as power method in PageRank.
The applicant hereby discloses in isolation each individual feature described herein and any combination of two or more such features, to the extent that such features or combinations are capable of being carried out based on the present specification as a whole in the light of the common general knowledge of a person skilled in the art, irrespective of whether such features or combinations of features solve any problems disclosed herein, and without limitation to the scope of the claims. The applicant indicates that aspects of the present invention may consist of any such individual feature or combination of features. In view of the foregoing description it will be evident to a person skilled in the art that various modifications may be made within the scope of the invention.
This application is a continuation of International Application No. PCT/EP2019/073406, filed on Sep. 3, 2019, the disclosure of which is hereby incorporated by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/EP2019/073406 | Sep 2019 | US |
Child | 17686113 | US |