The present disclosure relates generally to methods and techniques for determining optimal policies for network monitoring, public surveillance or infrastructure security domains.
Recent years have seen a rise in interest in applying game theoretic methods to real world problems wherein one player (referred to as the leader) chooses a strategy (which may be a non-deterministic i.e. mixed strategy) to commit to, and waits for the other player (referred to as the follower) to respond. Examples of such problems include network monitoring, public surveillance or infrastructure security domains where the leader commits to a mixed, randomized patrolling strategy in an attempt to thwart the follower from compromising resources of high value to the leader. In particular, a known technique referred to as the ARMOR system such as described in the reference to Pita, J., Jain, M., Western, C., Portway, C., Tambe, M., Ordonez, F., Kraus, S., Paruchuri, P. entitled Deployed ARMOR protection: The application of a game-theoretic model for security at the Los Angeles International Airport in Proceedings of AAMAS (Industry Track) (2008), suggests where to deploy security checkpoints to protect terminal approaches of Los Angeles International Airport. A further technique described in a reference to Tsai, J., Rathi, S., Kiekintveld, C., Ordonez, F., Tambe, M. entitled IRIS—A tool for strategic security allocation in transportation networks in Proceedings of AAMAS (Industry Track) (2009) proposes flight routes for the Federal Air Marshals to protect domestic and international flight from being hijacked and the PROTECT system (under development) suggests routes for the United States Coast Guard to survey critical infrastructure in the Boston harbor.
In arriving at optimal leader strategies for the above-mentioned and other domains, of critical importance is the leader's ability to profile the followers. In essence, determining the preferences of the follower actions is a vital step in predicting the follower rational response to leader actions which in turn allows the leader to optimize its mixed strategy to commit to. In security domains in particular it is very problematic to provide precise and accurate information about the preferences and capabilities of possible attackers. For example, the follower might have a different valuation from the leader valuation of the resources that the leader protects which leads to situations where some leader resources are at an elevated risk of being compromised. For example, a leader might value an airport fuel depot at $10M whereas the follower (without knowing that the depot is empty) might value the same depot at $20M. A fundamental problem that the leader thus has to address is how to act, over a prolonged period of time, given the initial lack of knowledge (or only a vague estimate) about the types of the followers and their preferences. Examples of such problems can be found in security applications for computer networks, see for instance, a reference to Alpcan, T., Basar, T. entitled “A game theoretic approach to decision and analysis in network intrusion detection,” in Proceedings of the 42nd IEEE Conference on Decision and Control, pp. 2595-2600 (2003) and, see reference to Nguyen, K. C., Basar, T. A. T. entitled “Security games with incomplete information,” in Proceeding of IEEE International Conference on Communications (ICC 2009) (2009) where the hackers are rarely caught and prevented from future attacks while their profiles are initially unknown.
Domains where the leader acts first by choosing a mixed strategy to commit to and the follower acts second by responding to the leader's strategy can be modeled as Stackelberg games.
In a Bayesian Stackelberg game the situation is more complex as the follower agent can be of multiple types (encountered with a given probability), and each type can have a different payoff matrix associated with it. The optimal strategy of the leader must therefore consider that the leader might end up playing the game with any opponent type. It has been shown that computing the Strong Bayesian Stackelberg Equilibrium is an NP-hard problem.
Formally, a Stackelberg game is defined as follows: Al={al
Given the follower type θ∈Θ, the expected utility of the leader strategy σ is therefore given by:
Given a probability distribution P(Θ) over the follower types, the expected utility of the leader strategy σ over all the follower types is hence:
Solving a single-round Bayesian Stackelberg game involves finding σ*=arg maxν∈ΣU (σ).
In an example Stackelberg game 10 such as shown in
Despite recent progress on solving Bayesian Stackelberg games (games where the leader faces an opponent of different types, with different preferences) it is commonly assumed that the payoff structure (and thus also their preferences) of both players are known to the players (either as the payoff matrices or the probability distributions over the payoffs).
It would be highly desirable to provide an approach to the problem of solving a repeated Stackelberg Game, played for a fixed number of rounds, where the payoffs or preferences of the follower and the prior probability distribution over follower types are initially unknown to the leader.
Rounds, Unknown Followers
In repeated Stackelberg games such as described in Letchford et al., entitled “Learning and Approximating the Optimal Strategy to Commit To,” in Proceedings of the Symposium on Algorithmic Game Theory, 2009, nature first selects a follower type θ∈Θ, upon which the leader then plays H rounds of a Stackelberg game against that follower. Across all rounds, the follower is assumed to act rationally (albeit myopically), whereas the leader aims to act strategically, so as to maximize total utility collected in all H stages of the game. The leader may never quite learn the exact type 9 that it is playing against: Instead, the leader uses the observed follower responses to its actions to narrow down the subset of types and utility functions that are consistent with the observed responses.
To illustrate the concept of a repeated Stackelberg game with unknown follower preferences refer again to
Letchford et al. propose a method for learning the follower preferences in as few game rounds as possible, however, this technique is deficient: First, while the method ensures that the leader learns the complete follower preferences structure (i.e. follower responses to any mixed strategy of the leader) in as few rounds as possible (by probing the follower responses with carefully chosen leader mixed strategies), it ignores the payoffs that the leader is receiving during in these rounds. In essence, the leader only values exploration of the follower preferences and ignores the exploitation of the already known follower preferences, for its own benefit. Second, the method of the prior art solution does not allow the follower to be of many types.
Further, existing work has predominantly focused on single-round games and as such, only the exploitation part of the problem was being considered. That is, methods may compute the optimal leader mixed strategy for just a single round of the game, given all the available information about the follower preferences and/or payoffs. While in contrast, the work by Letchford et al. considers a repeated-game scenario, it does not consider that the leader would optimize her own payoffs. Instead that work presumed that the leader would act so as to uniquely determine the follower preferences in the fewest number of rounds of rounds which may be arbitrarily expensive for the leader. In addition, the technique proposed by Letchford et al. only considers non-Bayesian Stackelberg game in that the authors assumed that the follower is of a single type.
A system, method and computer program product for solving a repeated Stackelberg Game, played for a fixed number of rounds, where the payoffs or preferences of the follower and the prior probability distribution over follower types are initially unknown to the leader.
Accordingly, there is provided a system, method and computer program product for planning actions in repeated Stackelberg games with unknown opponents, in which a prior probability distribution over preferences of the opponents is available, the method comprising: running, in a simulator including a programmed processor unit, a plurality of simulation trials from a root node specifying the initial state of a repeated Stackelberg game, that results in an outcome in the form of a utility to the leader, wherein one or more simulation trials comprises one or more rounds comprising: selecting, by the leader, a mixed strategy to play in the current round; determining at a current round, a response of the opponent, of type fixed at the beginning of a trial according to the prior probability distribution, to the leader strategy selected; computing a utility of the leader strategy given the opponent response in the current round; updating an estimate of expected utility for the leader action at this round; and, recommending, based on the estimated expected utility of leader actions at the root node, an action to perform in the initial state of a repeated Stackelberg game, wherein a computing system including at least one processor and at least one memory device connected to the processor performs the running and the recommending.
Further to this aspect, the simulation trials are run according to a Monte Carlo Tree Search method.
Further, according to the method, at the one or more rounds, the method further comprises inferring opponent preferences given observed opponent responsive actions in prior rounds up to the current round.
Further, according to the method, the inferring further comprises: computing opponent best response sets and opponent best response anti-sets, said opponent best response set being a convex set including leader mixed strategies for which the leader has observed or inferred that the opponent will respond by executing an action, and said best response anti-sets each being a convex set that includes leader mixed strategies for which the leader has inferred that the follower will not respond by executing an action.
Further, in one embodiment, the processor device is further configured to perform pruning of leader strategies satisfying one or more of: suboptimal expected payoff in the current round, and a suboptimal expected sum of payoffs in subsequent rounds.
Further, the leader actions are selected from among a finite set of leader mixed strategies, wherein said finite set comprises leader mixed strategies whose pure strategy probabilities are integer multiples of a discretization interval.
Further, in one embodiment, the estimate of an expected utility of a leader action includes a benefit of information gain about an opponent response to said leader action combined with an immediate payoff for the leader for executing said leader action.
Further, in one embodiment, the updating the estimate of expected utility for the leader action at the current round comprises: averaging the utilities of the leader action at the current round, across multiple trials that share the same history of leader actions and follower responses up to the current round.
A computer program product is provided for performing operations. The computer program product includes a storage medium readable by a processing circuit and storing instructions run by the processing circuit for running a method. The method is the same as listed above.
The objects, features and advantages of the present invention will become apparent to one skilled in the art, in view of the following detailed description taken in combination with the attached drawings, in which:
In one aspect, there is formulated a Stackelberg game problem, and in particular, a Multi-round Stackelberg game having 1) Unknown adversary types; and, 2) Unknown adversary payoffs (e.g., follower preferences). A system, method and computer program product provides a solution for exploring the unknown adversary payoffs or exploiting the available knowledge about the adversary to optimize the leader strategy across multiple rounds.
In one embodiment, the method optimizes the expected cumulative reward-to-go of the leader who faces an opponent of possibly many types and unknown preference structures.
In one aspect, the method employs the Monte Carlo Tree Search (MCTS) sampling technique to estimate the utility of leader actions (its mixed strategies) in any round of the game. The utility is understood as comprising the benefit of information gain about the best follower response to a given leader action combined with immediate payoff for the leader for executing the leader action. In addition, for improving the efficiency of MCTS employed to the problem at hand, the method further performs determining what leader actions, albeit applicable, should not be considered by the MCTS sampling technique.
One key innovation of MCTS is to incorporate node evaluations within traditional tree search techniques that are based on stochastic simulations (i.e., “rollouts” or “playouts”), while also using bandit-sampling algorithms to focus the bulk of simulations on the most promising branches of the tree search. This combination appears to have overcome traditional exponential scaling limits to established planning techniques in a number of large-scale domains.
Standard implementations of MCTS maintain and incrementally grow a collection of nodes, usually organized in a tree structure, representing possible states that could be encountered in the given domain. The nodes maintain counts nsa of the number of simulated trials in which action a was selected in state s, as well as mean reward statistics
One implementation of MCTS makes use of the UCT algorithm (e.g., as described in L. Kocsis and C. Szepesvari entitled “Bandit based Monte-Carlo Planning” in 15th European Conference on Machine Learning, pages 282-293, 2006), which employs a tree-search policy based on a variant of the UCB1 bandit-sampling algorithm (e.g., as described in the reference “Finite-time Analysis of the Multiarmed Bandit Problem” by P. Auer, et al. from Machine Learning 47:235-256, 2002). The policy computes an upper confidence bound Bsa for each possible action a in a given state s according to: Bsa=
As further shown in
Leader strategies in each round of each trial are selected by MCTS using either the UCB 1 tree-search policy for the initial rounds within the tree, or a playout policy for the remaining rounds taking place outside the tree. One playout policy uses uniform random selection of leader mixed strategies for each remaining round of the playout. The MCTS tree is grown incrementally with each trial, starting from just the root node at the first trial. Whenever a new leader mixed strategy is tried from a given node, the set of all possible transition nodes (i.e. leader mixed strategy followed by all possible follower responses) are added to the tree representation.
In one aspect, as shown in
For improving the efficiency of MCTS employed, some embodiments of the method also perform determining what leader actions, albeit applicable, should not be considered by the MCTS sampling technique.
Pruning of Leader's Strategies
In some cases, the leader's exploration of the complete reward structure of the follower is unnecessary. In essence, in any round of the game, the leader can identify unsampled leader mixed strategies whose immediate expected value for the leader is guaranteed not to exceed the expected value of leader strategies employed by the leader in the earlier rounds of the game. If the leader then just wants to maximize the expected payoff of its next action, these not-yet-employed strategies can safely be disregarded (i.e., pruned).
As indicated at step 110,
Where Ū(θ, σ) is the upper bound on the expected utility of the leader playing σ, established from the leader observations B(θ, σ′); σ′∈E(n) as follows:
Where Af(σ)⊂Af is a set of follower actions af that can still (given B(θ, σ′); σ′∈E(n)) constitute the follower best response to σ while U(σ, af) is the expected utility of the leader mixed strategy σ if the follower responds to it by executing action af. That is:
Thus, in order to determine whether a not-yet-employed strategy σ should be executed, the method includes determining the elements of a best response set Af(σ) given B(θ, σ′);
σ′∈E(n).
To find the actions that can still constitute the best response of the follower of type θ to a given leader strategy σ, there is first defined the concept of Best Response Sets and Best Response Anti-Sets.
For each action af∈Af of the follower, there is first defined a best response set Σa
For each action af∈Af of the follower, there is second defined a best response anti-set
It is proved by contradiction a first proposition (“Proposition 1”) that each best response set Σa
Finding the follower best response(s) is now illustrated by an example such as shown in
Notice, how in this example context it is not profitable for the leader to employ a mixed strategy σ such that σ(al
Ū(θ,σ)=max{U(σ,af
Hence, while employing strategy σ would allow the leader to learn B(θ, σ) (i.e., to disambiguate in
Thus, considering one MCTS trial, that is, one complete H-round game utilizing a fixed follower type, as shown in the
The example in
Finally, the example in
The method in one embodiment provides a fully automated procedure for determining these leader strategies that can be safely eliminated from the MCTS action space in a given node, for a given MCTS trial.
When an MCTS trial starts (at the root node), the follower type is initially unknown, hence the leader does not know any follower best response sets Σa
At a basic level, the pruning method maintains convex best response sets Σa
The pruning method runs independently of MCTS and can be applied to any node whose parent has already been serviced by the pruning method. There is provided to the programmed computer system including a processor device and memory storage system, data maintained at such node corresponding to a situation where the rounds 1, 2, . . . , k−1 of the game have already been played. At 302, there is input the set of leader strategies that have not yet been pruned denoted as Σ(k-1)⊂Σ (and not to be confused with the set E(k-1) of leader strategies employed in rounds 1, 2, . . . , k−1 of the game). There is Σ(0)=Σ at the root node. Also, at 302 there is assigned Σa
In
Similarly, in another embodiment, there is constructed two antisets implied by set 370 and two antisets implied by set 365. However, as the leader is playing a Bayesian Stackelberg game with a rational opponent repeatedly, the leader can probe the opponent in order to learn its preferences. Thus, by selective probing (i.e., sampling a leader action) observing the responses allows the leader make deductions regarding opponent strategies, e.g., by adding a point to the simplex space, and, according to the pruning method of
In one non-limiting example implementation of the pruning method depicted in
Thus, the present technique may be deployed in real domains that may be characterized as Bayesian Stackelberg games, including, but not limited to security and monitoring deployed at airports, and randomization in scheduling of Federal air marshal service, and other security applications.
As will be appreciated by one skilled in the art, aspects of the present invention may be embodied as a system, method or computer program product. Accordingly, aspects of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a “circuit,” “module” or “system.” Furthermore, aspects of the present invention may take the form of a computer program product embodied in one or more computer readable medium(s) having computer readable program code embodied thereon.
Any combination of one or more computer readable medium(s) may be utilized. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with a system, apparatus, or device running an instruction.
A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with a system, apparatus, or device running an instruction. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C++ or the like and conventional procedural programming languages, such as the “C” programming language or similar programming languages. The program code may run entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).
Aspects of the present invention are described below with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which run via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer program instructions may also be stored in a computer readable medium that can direct a computer, other programmable data processing apparatus, or other devices to function in a particular manner, such that the instructions stored in the computer readable medium produce an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The computer program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which run on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more operable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be run substantially concurrently, or the blocks may sometimes be run in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
While there has been shown and described what is considered to be preferred embodiments of the invention, it will, of course, be understood that various modifications and changes in form or detail could readily be made without departing from the spirit of the invention. It is therefore intended that the scope of the invention not be limited to the exact forms described and illustrated, but should be construed to cover all modifications that may fall within the scope of the appended claims.