The present disclosure generally relates to the field of computer processing, and in particular to reinforcement learning architectures in machine learning.
Reinforcement learning can be used to train and deploy computerized agents (hereinafter simply “agents”) in trading markets, however, such application may carry fundamental challenges such as high variance and costly exploration. Moreover, markets are inherently a multi-agent domain having many actors taking actions and changing the environment. To tackle these type of scenarios, agents need to exhibit certain characteristics such as risk-awareness, robustness to perturbations and low learning variance.
According to an aspect, there is provided a computer-implemented system for training an automated agent. The system includes a communication interface; at least one processor; memory in communication with the at least one processor; software code stored in the memory, which when executed at the at least one processor causes the system to: instantiate an automated agent that maintains a reinforcement learning neural network and generates, according to outputs of the reinforcement learning neural network, signals for communicating task requests; receive, by way of the communication interface, a plurality of training input data including a plurality of states and a plurality of actions for the automated agent; initialize a learning table Q for the automated agent based on the plurality of states and the plurality of actions; compute a plurality of updated learning tables based on the initialized learning table Q using a utility function, the utility function comprising a monotonically increasing concave function; and generate an averaged learning table Q′ based on the plurality of updated learning tables.
In some embodiments, the automated agent is configured to select an action based on the averaged learning table Q′ for communicating one or more task requests.
In some embodiments, utility function is represented by u(x)=−eβx, β<0.
In some embodiments, computing a plurality of updated learning tables may include: receiving, by way of the communication interface, an input parameter k and a training step parameter T; and for each training step t, where t=1, 2 . . . T: computing an interim learning table {circumflex over (Q)} based on the initialized learning table Q; selecting an action at from the plurality of actions based on the interim learning table {circumflex over (Q)} and a given state st from the plurality of states; computing a reward rt and a next state st+1 based on the selected action at; and for at least two values of i, where i=1, 2, . . . , k, computing a respective updated learning table Qi of the plurality of updated learning tables based on (st,at, rt, st+1) and the utility function.
In some embodiments, the averaged learning table Q′ is computed as
In some embodiments, the utility function is a first utility function and the software code, when executed at the at least one processor, further causes the system to: instantiate an adversarial agent that maintains an adversarial reinforcement learning neural network and generates, according to outputs of the adversarial reinforcement learning neural network, signals for communicating adversarial task requests; initialize an adversarial learning table QA for the adversarial agent; compute a plurality of updated adversarial learning tables based on the initialized adversarial learning table QA using a second utility function, the second utility function comprising a monotonically increasing convex function; and generate an averaged adversarial learning table QA′ based on the plurality of updated adversarial learning tables.
In some embodiments, the adversarial agent is configured to select an adversarial action based on the averaged adversarial learning table QA′ to minimize a reward for the automated agent.
In some embodiments, the second utility function is represented by uA(x)=−eβ
In some embodiments, computing a plurality of updated adversarial learning tables may include: receiving, by way of the communication interface, an input parameter k and a training step parameter T; and for each training step t, where t=1, 2 . . . T: computing an interim adversarial learning table {circumflex over (Q)}A based on the initialized adversarial learning table QA; selecting an adversarial action atA based on the interim adversarial learning table {circumflex over (Q)}A and a given state st from the plurality of states; computing an adversarial reward rtA and a next state st+1 based on the selected adversarial action atA; and for at least two values of i, where i=1, 2, . . . , k, computing a respective updated adversarial learning table QiA of the plurality of updated adversarial learning tables based on (st,atA,rtA,st+1) and the second utility function.
In some embodiments, the averaged adversarial learning table Q A′ is computed as
According to another aspect, there is provided a computer-implemented method of training an automated agent, the method including: instantiating an automated agent that maintains a reinforcement learning neural network and generates, according to outputs of the reinforcement learning neural network, signals for communicating task requests; receiving, by way of the communication interface, a plurality of training input data including a plurality of states and a plurality of actions for the automated agent; initializing a learning table Q for the automated agent based on the plurality of states and the plurality of actions; computing a plurality of updated learning tables based on the initialized learning table Q using a utility function, the utility function comprising a monotonically increasing concave function; and generating an averaged learning table Q′ based on the plurality of updated learning tables.
In some embodiments, the method may further include: selecting an action, by the automated agent, based on the averaged learning table Q′ for communicating one or more task requests.
In some embodiments, utility function is represented by u(x)=−eβx, β<0.
In some embodiments, computing a plurality of updated learning tables may include: receiving, by way of the communication interface, an input parameter k and a training step parameter T; and for each training step t, where t=1, 2 . . . T: computing an interim learning table {circumflex over (Q)} based on the initialized learning table Q; selecting an action at from the plurality of actions based on the interim learning table {circumflex over (Q)} and a given state st from the plurality of states; computing a reward rt and a next state st+1 based on the selected action at; and for at least two values of i, where i=1, 2, . . . , k, computing a respective updated learning table Qi of the plurality of updated learning tables based on (st,at, rt, st+1) and the utility function.
In some embodiments, the averaged learning table Q′ is computed as
In some embodiments, the utility function is a first utility function and the method may further include: instantiating an adversarial agent that maintains an adversarial reinforcement learning neural network and generates, according to outputs of the adversarial reinforcement learning neural network, signals for communicating adversarial task requests; initializing an adversarial learning table QA for the adversarial agent; computing a plurality of updated adversarial learning tables based on the initialized adversarial learning table QA using a second utility function, the second utility function comprising a monotonically increasing convex function; and generating an averaged adversarial learning table QA′ based on the plurality of updated adversarial learning tables.
In some embodiments, the method may further include selecting an adversarial action by the adversarial agent based on the averaged adversarial learning table QA′ to minimize a reward for the automated agent.
In some embodiments, the second utility function is represented by uA (x)=−eβ
In some embodiments, computing a plurality of updated adversarial learning tables may include: receiving, by way of the communication interface, an input parameter k and a training step parameter T; and for each training step t, where t=1, 2 . . . T: computing an interim adversarial learning table QA based on the initialized adversarial learning table QA; selecting an adversarial action atA based on the interim adversarial learning table QA and a given state st from the plurality of states; computing an adversarial reward rtA and a next state st+, based on the selected adversarial action atA; and for at least two values of i, where i=1, 2, . . . , k, computing a respective updated adversarial learning table QiA of the plurality of updated adversarial learning tables based on (st,atA,rtA,st+1) and the second utility function.
In some embodiments, the averaged adversarial learning table QA′ is computed as
Other features will become apparent from the drawings in conjunction with the following description.
In the figures, embodiments are illustrated by way of example. It is to be expressly understood that the description and figures are only for the purpose of illustration and as an aid to understanding.
Embodiments will now be described, by way of example only, with reference to the attached figures, wherein in the figures:
Reinforcement learning (RL) is a type of machine technique that enables an agent to learn in an interactive environment by trial and error using feedback from its own actions and experiences, in order to maximize a reward. RL has been applied to a number of fields such as games [5], navigation [4], software engineering [2], industrial design [22], and finance [18]. Each of these applications has inherent difficulties which are long-standing fundamental challenges in RL, such as: limited training time, costly exploration and safety considerations, among others.
In particular, in finance, there are some examples of RL in stochastic control problems such as option pricing [19], market making [29], and optimal execution [24]. An example finance application is trading, where the system is configured to implement trained algorithms capable of automatically making trading decisions based on a set of stored rules computed by a machine [31].
In trading, the environment represents the market (and the rest of the actors). The agent's task is to take actions related to how and how much to trade, and the objective is usually to maximize profit while minimizing risk. There are several challenges in this setting such as partial observability, a large action space, a difficult definition of rewards and learning objectives [31]. In this disclosure, two properties for learning agents in realistic trading market scenarios are considered and implemented: risk assessment and robustness. In some embodiments, one or more RL algorithms are implemented with risk-averse objective functions and variance reduction techniques. In some embodiments, the RL algorithms are implemented to operate in a multi-agent learning environment, and can assume an adversary which may take over and perturb the learning process. These RL algorithms are developed to balance theoretical guarantees with practical use. Additionally, an empirical game theory analysis for multi-agent learning by considering risk-averse payoffs is performed and discussed herein.
Risk assessment is a cornerstone in financial applications. One approach is to consider risk while assessing the performance (profit) of a trading strategy. Here, risk is a quantity related to the variance (or standard deviation) of the profit and it is commonly refereed to as “volatility”. In particular, the Sharpe ratio [27] considers both the generated profit and the risk (variance) associated with a trading strategy. This objective function (Sharpe ratio) is different from traditional RL where the goal is to optimize the expected return, usually, without considerations of risk. There are works that proposed risk-sensitive RL algorithms [21, 12] and variance reduction techniques [1]. The RL algorithms discussed in this disclosure improve upon these works, by further reducing variance, for example, through the combination of using a utility function in updating multiple Q-tables in a Q-learning environment, while also having convergence guarantees and improved robustness via adversarial learning.
Deep RL has been shown to be brittle is many scenarios [15], therefore, improving robustness is important for deploying agents in realistic scenarios, such as for use in trading platforms. A line of work has improved robustness of RL agents via adversarial perturbations [23, 26]. For example, the learning framework or system may assume an adversary (who is also learning) is allowed to take over control at regular intervals. This approach has shown good experimental results in robotics [25].
Trading market can be seen as a multi-agent interaction environment. Therefore, the agents in the RL algorithms may be evaluated from the perspective of game theory. However, it may be too difficult to analyze in standard game theoretic framework since there is no normal form representation (commonly used to analyze games). Fortunately, empirical game theory [35, 38] overcomes this limitation by using the information of several rounds of repeated interactions and assuming a higher level of strategies (agents' policies). These modifications have made possible the analysis of multi-agent interactions in complex scenarios such as markets [7], and multi-agent games [33]. However, these works have not studied the interactions under risk metrics (such as Sharpe ratio), which are explored in this disclosure.
In summary, the RL algorithms disclosed, in some embodiments, combine risk-awareness, variance reduction and robustness techniques. For example, a Risk-Averse Averaged Q-Learning (e.g., RA2-Q shown in
A computer system is described next in which the various RL algorithms may be implemented to train one or more automated agents.
As detailed herein, in some embodiments, system 100 includes features adapting it to perform certain specialized purposes, e.g., to function as a trading platform. In such embodiments, system 100 may be referred to as trading platform 100 or simply as platform 100 for convenience. In such embodiments, the automated agent may generate requests for tasks to be performed in relation to securities (e.g., stocks, bonds, options or other negotiable financial instruments). For example, the automated agent may generate requests to trade (e.g., buy and/or sell) securities by way of a trading venue.
Referring now to the embodiment depicted in
A processor 104 is configured to execute machine-executable instructions to train a reinforcement learning network 110 through a training engine 112. The training engine can be configured to generate signals based on one or more rewards or inventives to train automated agents 180 to perform desired tasks more optimally, e.g., to minimize and maximize certain performance metrics such as risk or variance.
The platform 100 can connect to an interface application 130 installed on user device to receive input data. Trade entities 150a, 150b can interact with the platform to receive output data and provide input data. The trade entities 150a, 150b can have at least one computing device. The platform 100 can train one or more reinforcement learning neural networks 110. The trained reinforcement learning networks 110 can be used by platform 100 or can be for transmission to trade entities 150a, 150b, in some embodiments. The platform 100 can process trade orders using the reinforcement learning network 110 in response to commands from trade entities 150a, 150b, in some embodiments.
The platform 100 can connect to different data sources 160 and databases 170 to receive input data and receive output data for storage. The input data can represent trade orders. Network 140 (or multiple networks) is capable of carrying data and can involve wired connections, wireless connections, or a combination thereof. Network 140 may involve different network communication technologies, standards and protocols, for example.
The platform 100 can include an I/O unit 102, a processor 104, communication interface 106, and data storage 120. The I/O unit 102 can enable the platform 100 to interconnect with one or more input devices, such as a keyboard, mouse, camera, touch screen and a microphone, and/or with one or more output devices such as a display screen and a speaker.
The processor 104 can execute instructions in memory 108 to implement aspects of processes described herein. The processor 104 can execute instructions in memory 108 to configure a data collection unit, interface unit (to provide control commands to interface application 130), reinforcement learning network 110, training engine 112, and other functions described herein. The processor 104 can be, for example, any type of general-purpose microprocessor or microcontroller, a digital signal processing (DSP) processor, an integrated circuit, a field programmable gate array (FPGA), a reconfigurable processor, or any combination thereof.
As depicted in
Reinforcement learning is a category of machine learning that configures agents, such the automated agents 180 described herein, to take actions in an environment to maximize a notion of a reward, and in some embodiments, the reward can be maximized by minimizing risks or variances. The processor 104 is configured with machine executable instructions to instantiate an automated agent 180 that maintains a reinforcement learning neural network 110 (also referred to as a reinforcement learning network 110 for convenience), and to train the reinforcement learning network 110 of the automated agent 180 using a training engine 112. The processor 104 is configured to control the reinforcement learning network 110 to process input data in order to generate output signals. Input data may include trade orders, various feedback data (e.g., rewards), or feature selection data, or data reflective of completed tasks (e.g., executed trades), data reflective of trading schedules, etc. Output signals may include signals for communicating resource task requests, e.g., a request to trade in a certain security. For convenience, a good signal may be referred to as a “positive reward” or simply as a reward, and a bad signal may be referred as a “negative reward” or as a “punishment.
Referring again to
Memory 108 may include a suitable combination of any type of computer memory that is located either internally or externally such as, for example, random-access memory (RAM), read-only memory (ROM), compact disc read-only memory (CDROM), electro-optical memory, magneto-optical memory, erasable programmable read-only memory (EPROM), and electrically-erasable programmable read-only memory (EEPROM), Ferroelectric RAM (FRAM) or the like. Data storage devices 120 can include memory 108, databases 122, and persistent storage 124.
The communication interface 106 can enable the platform 100 to communicate with other components, to exchange data with other components, to access and connect to network resources, to serve applications, and perform other computing applications by connecting to a network (or multiple networks) capable of carrying data including the Internet, Ethernet, plain old telephone service (POTS) line, public switch telephone network (PSTN), integrated services digital network (ISDN), digital subscriber line (DSL), coaxial cable, fiber optics, satellite, mobile, wireless (e.g. Wi-Fi, WiMAX), SS7 signaling network, fixed line, local area network, wide area network, and others, including any combination of these.
The platform 100 can be operable to register and authenticate users (using a login, unique identifier, and password for example) prior to providing access to applications, a local network, network resources, other networks and network security devices. The platform 100 may serve multiple users which may operate trade entities 150a, 150b.
The data storage 120 may be configured to store information associated with or created by the components in memory 108 and may also include machine executable instructions. The data storage 120 includes a persistent storage 124 which may involve various types of storage technologies, such as solid state drives, hard disk drives, flash memory, and may be stored in various formats, such as relational databases, non-relational databases, flat files, spreadsheets, extended markup files, etc.
As shown in
In some embodiments, once the reinforcement learning network 110 has been trained, it generates output signal 188 reflective of its decisions to take particular actions in response to input data. Input data can include, for example, a set of data obtained from one or more data sources 160, which may be stored in databases 170 in real time or near real time.
As a practical example, an HVAC control system which may be configured to set and control heating, ventilation, and air conditioning units (HVAC) for a building, in order to efficiently manage the power consumption of HVAC units, the control system may receive sensor data representative of temperature data in a historical period. The control system may be implemented to use an automated agent 180 and a trained reinforcement learning network 110 to generate an output signal 188, which may be a resource request command signal 188 indicative of a set value or set point representing a most optimal room temperature based on the sensor data, which may be part of input data 185, representative of the temperature data in present and in a historical period (e.g., the past 72 hours or the past week).
The input data 185 may include a time series data that is gathered from sensors 160 placed at various points of the building. The measurements from the sensors 1160, which form the time series data, may be discrete in nature. For example, the time series data may include a first data value 21.5 degrees representing the detected room temperature in Celsius at time t1, a second data value 23.3 degrees representing the detected room temperature in Celsius at time t2, a third data value 23.6 degrees representing the detected room temperature in Celsius at time t3, and so on.
Other input data 185 may include a target range of temperature values for the particular room or space and/or a target room temperature or a target energy consumption per hour. A reward may be generated based on the target room temperature range or value, and/or the target energy consumption per hour.
In some examples, one or more automated agents 180 may be implemented, each agent 180 for controlling the room temperature for a separate room or space within the building which the HVAC control system is monitoring.
As another example, in some embodiments, a traffic control system which may be configured to set and control traffic flow at an intersection. The traffic control system may receive sensor data representative of detected traffic flows at various points of time in a historical period. The traffic control system may use an automated agent 180 and trained reinforcement learning network 110 to control a traffic light based on input data representative of the traffic flow data in real time, and/or traffic data in the historical period (e.g., the past 4 or 24 hours).
The input data 185 may include sensor data gathered from one or more data sources 160 (e.g. sensors 160) placed at one or more points close to the traffic intersection. For example, the time series data 112 may include a first data value 3 vehicles representing the detected number of cars at time t1, a second data value 1 vehicles representing the detected number of cars at time t2, a third data value 5 vehicles representing the detected number of cars at time t3, and so on.
Based on a desired traffic flow value at tn, the automated agent 180, based on neural network 110, may then generate an output signal 188 to shorten or lengthen a red or green light signal at the intersection, in order to ensure the intersection is least likely to be congested during one or more points in time.
As yet another example, the input data 185 may include a set of measured blood pressure values or blood sugar levels in a time period measured by one or more data sources such as medical devices 160. The trained reinforcement learning network 110 may receive the input data 185 from the sensors 160 or a database 170, and generate an output signal 185 representing a predicted data value representing a future blood pressure value or a future blood sugar level. The output signal 185 representing the predicted data value may be transmitted to a health care professional for monitoring or medical purposes.
In some embodiments, as another example, an automated agent 180 in system 100 may be trained to play a video game, and more specifically, a lunar lander game 300, as shown in
In some embodiments, the reward may indicate a plurality of objectives including: smoothness of landing, conservation of fuel, time used to land, and distance to a target area on the landing pad. The reward, which may be a reward vector, can be used to train the neural network 110 for landing the lunar lander by the automated agent 180.
A Markov Decision Process (MDP) is defined by a set of states describing the possible configurations, a set of actions and a set of observations for each agent. A stochastic policy πθ: ×→[0,1] parameterized by θ produces the next state according to the state transition function : ×→. The agent obtains rewards as a function of the state and agent's action r: ×→, and receives a private observation correlated with the state o:→. The initial states are determined by a distribution d0: →[0,
In RL, each agent i aims to maximize its own total expected return, e.g., for a Markov game with two agents, for a given initial state distribution d0, the discounted returns are respectively:
J
1(d0,π1,π2)Σt=0∞γt[rt1|π1,π2,d0] (1)
J
2(d0,π1,π2)Σt=0∞γt[rt2|π1,π2,d0] (2)
where γ is a discount factor, rt1, rt2, t=1, 2, . . . are respectively immediate rewards for agent 1 and agent 2. A Nash equilibrium for Markov game (with two agents) is defined below.
Definition 1 [16] A Nash equilibrium point of game (J1, J2) is a pair of strategies (π*1, π*2) such that for ∀s ∈ ,
J
1(s,π*1,π*2)≥J1(s,π1,π*2) ∀π1 (3)
J
2(s,π*1,π*2)≥J2(s,π*1,π2) ∀π2 (4)
Multi-Agent Extension of MDP
A Markov game for N agents is defined by a set of states S describing the possible configurations of all agents, a set of actions 1, . . . , N and a set of observations 1, . . . , N for each agent. To choose actions, each agent i uses a stochastic policy πθ
Q-learning can use a Q-table to guide an agent to find the best action. A Q-table can be generated based on available [state, action] pairs to the agent, and updated with appropriate values after an action is taken by the agent during a training step or eposide. This Q-table acts as a reference table for the agent to select the most optimal action based on each value in the table. In multi-agent Q-learning, the Q-tables are defined over joint actions for each of the agents. Each agent receives rewards according to its reward function, with transitions dependent on the actions chosen jointly by the set of agents.
In some embodiments, the multi-agent behaviours in a trading market can be analyzed using empirical game theory, where a player corresponds to an agent, and a strategy corresponds to a learning algorithm. Then, in a p-player game, players are involved in a single round strategic interaction. Each player i can be configured to select a strategy πi from a set of k strategy Si={π1i, . . . , πki} and receive a stochastic payoff Ri(π1, . . . , πp)S1×S2× . . . ×Sp→. The underlying game that is usually studied is ri(πi, . . . , πp)=[Ri(π1, . . . , πp)]. In general, the payoff of player i can be denoted as pit, and the joint strategy of all players except for player i can be denoted as x−i.
Definition 2 A joint strategy x=(x1, . . . , xp)=(xi, x−i) is a Nash equilibrium if for all i:
Definition 3 A joint strategy x=(x1, . . . , xp)=(xi, x−i) is an ϵ-Nash equilibrium if for all i:
Evolutionary dynamics can be used to analyze multi-agent interactions. An example model is replicator dynamics (RD) [36] which describes how a population evolves through time under evolutionary pressure (in the present disclosure, a population is composed by learning algorithms). RD assumes that the reproductive success is determined by interactions and their outcomes. For example, the population of a certain type increases if they have a higher fitness (in the present disclosure, this means the expected return in certain interaction) than the population average; otherwise that population share will decrease.
To view the dominance of different strategies, it is common to plot the directional field of the payoff tables using the replicator dynamics for a number of strategy profiles x in the simplex strategy space [33].
The embodiments of RL algorithms shown in
Wainwright in [34] proposed a variance reduction Q-learning algorithm (V-QL) which can be seen as a variant of the SVRG algorithm in stochastic optimization [17]. Given an algorithm that converges to Q*, one of its iterates
In some embodiments, risk-averse objective functions [21] can be combined with the Q-learning algorithm to reduce variance and risk, as elaborated below.
Shen in [28] proposed a Q-learning algorithm that is shown to converge to the optimal of a risk-sensitive objective function. In [28], the training scheme is the same as Q-learning, except that in each iteration, a utility function is applied to a temporal difference (TD) error (see e.g., Algorithm 5 in
In order to optimize the expected return as well as minimize the variance of the expected return, an expected utility of the return can be used as the objective function instead:
By a straightforward Taylor expansion, Eq.(7) above yields:
where when β<0 the objective function is risk-averse, when β=0 the objective function is risk-neutral, and when β>0 the objective function is risk-seeking.
By applying a monotonically increasing concave utility function u(x)=−exp(βx) where β<0 to the TD error, Algorithm 5 (see e.g.,
Theorem 1 (Theorem 3.2, [28]) Running Algorithm 5 from an initial Q table, Q→Q* with probability (w.p.) 1, where Q* is the unique solution to
∀(s, a), where s′ is sampled from [·|s,a], and the corresponding policy π* of Q* satisfies {tilde over (J)}π
[16] proposed Nash-Q, a Multi-Agent Q-learning algorithm (e.g., Algorithm 6 in
In some example embodiments, a computer-implemented system for training an automated agent may include: a communication interface; at least one processor; memory in communication with the at least one processor; software code stored in the memory, which when executed at the at least one processor causes the system to: instantiate an automated agent that maintains a reinforcement learning neural network and generates, according to outputs of the reinforcement learning neural network, signals for communicating task requests; receive, by way of the communication interface, a plurality of training input data including a plurality of states and a plurality of actions for the automated agent; initialize a learning table Q for the automated agent based on the plurality of states and the plurality of actions; compute a plurality of updated learning tables based on the initialized learning table Q using a utility function, the utility function comprising a monotonically increasing concave function; and generate an averaged learning table Q′ based on the plurality of updated learning tables.
In some embodiments, the automated agent is configured to select an action based on the averaged learning table Q′ for communicating one or more task requests.
In some embodiments, utility function is represented by u(x)=−eβx, β<0.
In some embodiments, computing a plurality of updated learning tables may include: receiving, by way of the communication interface, an input parameter k and a training step parameter T; and for each training step t, where t=1, 2 . . . T: computing an interim learning table {circumflex over (Q)} based on the initialized learning table Q; selecting an action at from the plurality of actions based on the interim learning table {circumflex over (Q)} and a given state st from the plurality of states; computing a reward rt and a next state st+1 based on the selected action at; and for at least two values of i, where i=1, 2, . . . , k, computing a respective updated learning table Qi of the plurality of updated learning tables based on (st,at, rt, st+1) and the utility function.
In some embodiments, the averaged learning table Q′ is computed as
In some embodiments, the utility function is a first utility function and the software code, when executed at the at least one processor, further causes the system to: instantiate an adversarial agent that maintains an adversarial reinforcement learning neural network and generates, according to outputs of the adversarial reinforcement learning neural network, signals for communicating adversarial task requests; initialize an adversarial learning table QA for the adversarial agent; compute a plurality of updated adversarial learning tables based on the initialized adversarial learning table QA using a second utility function, the second utility function comprising a monotonically increasing convex function; and generate an averaged adversarial learning table QA′ based on the plurality of updated adversarial learning tables.
In some embodiments, the adversarial agent is configured to select an adversarial action based on the averaged adversarial learning table QA′ to minimize a reward for the automated agent.
In some embodiments, the second utility function is represented by uA(x)=−eβ
In some embodiments, computing a plurality of updated adversarial learning tables may include: receiving, by way of the communication interface, an input parameter k and a training step parameter T; and for each training step t, where t=1, 2 . . . T: computing an interim adversarial learning table {circumflex over (Q)}A based on the initialized adversarial learning table QA; selecting an adversarial action atA based on the interim adversarial learning table {circumflex over (Q)}A and a given state st from the plurality of states; computing an adversarial reward rtA and a next state st+1 based on the selected adversarial action atA; and for at least two values of i, where i=1, 2, . . . , k, computing a respective updated adversarial learning table QiA of the plurality of updated adversarial learning tables based on (st,atA,rtA,st+1) and the second utility function.
In some embodiments, the averaged adversarial learning table QA′ is computed as
In some example embodiments, there is a computer-implemented method of training an automated agent, the method may include: instantiating an automated agent that maintains a reinforcement learning neural network and generates, according to outputs of the reinforcement learning neural network, signals for communicating task requests; receiving, by way of the communication interface, a plurality of training input data including a plurality of states and a plurality of actions for the automated agent; initializing a learning table Q for the automated agent based on the plurality of states and the plurality of actions; computing a plurality of updated learning tables based on the initialized learning table Q using a utility function, the utility function comprising a monotonically increasing concave function; and generating an averaged learning table Q′ based on the plurality of updated learning tables.
In some embodiments, the method may further include: selecting an action, by the automated agent, based on the averaged learning table Q′ for communicating one or more task requests.
In some embodiments, the utility function is represented by u(x)=−eβx, β<0.
In some embodiments, computing a plurality of updated learning tables may include: receiving, by way of the communication interface, an input parameter k and a training step parameter T; and for each training step t, where t=1, 2 . . . T: computing an interim learning table {circumflex over (Q)} based on the initialized learning table Q; selecting an action at from the plurality of actions based on the interim learning table {circumflex over (Q)} and a given state st from the plurality of states; computing a reward rt and a next state st+1 based on the selected action at; and for at least two values of i, where i=1, 2, . . . , k, computing a respective updated learning table Qi of the plurality of updated learning tables based on (st,at, rt, st+1) and the utility function.
In some embodiments, the averaged learning table Q′ is computed as
In some embodiments, method may further include: instantiating an adversarial agent that maintains an adversarial reinforcement learning neural network and generates, according to outputs of the adversarial reinforcement learning neural network, signals for communicating adversarial task requests; initializing an adversarial learning table QA for the adversarial agent; computing a plurality of updated adversarial learning tables based on the initialized adversarial learning table QA using a second utility function, the second utility function comprising a monotonically increasing convex function; and generating an averaged adversarial learning table QA′ based on the plurality of updated adversarial learning tables.
In some embodiments, the method may further include selecting an adversarial action by the adversarial agent based on the averaged adversarial learning table QA′ to minimize a reward for the automated agent.
In some embodiments, the second utility function is represented by uA(x)=−eβ
In some embodiments, computing a plurality of updated adversarial learning tables may include: receiving, by way of the communication interface, an input parameter k and a training step parameter T; and for each training step t, where t=1, 2 . . . T: computing an interim adversarial learning table {circumflex over (Q)}A based on the initialized adversarial learning table QA; selecting an adversarial action atA based on the interim adversarial learning table {circumflex over (Q)}A and a given state st from the plurality of states; computing an adversarial reward rtA and a next state st+1 based on the selected adversarial action atA; and for at least two values of i, where i=1, 2, . . . , k, computing a respective updated adversarial learning table QiA of the plurality of updated adversarial learning tables based on (st,atA,rtA,st+1) and the second utility function.
In some embodiments, the averaged adversarial learning table QA′ is computed as
More specifically, a Risk-Averse Averaged Q-Learning (e.g., RA2-Q shown in
Table 1 below briefly summarizes each of the four machine learning models and their respective convergence guarantees (or lack thereof).
From t=1 to T, for each value of t (“while in the t loop”): Q=QH, the training system may compute an interim Q-table {circumflex over (Q)} by
where λP>0 is a constant; and
Next, while in the t loop, the training system may select action at according to {circumflex over (Q)} by applying ϵ-greedy strategy, execute the action and get (st,at, rt, st+1), which can be appended to the replay buffer RB=RB ∪ {(st,at, rt, st+1)}.
The training system may, while in the t loop, generate a mask M ∈ k˜Poisson(1), and for i=1, . . . , k, for each value of i (“while in the i loop”):
if and when Mi=1, update the learning table Qi by
where u is a utility function configured to minimize risks, and x0=−1.
In some embodiments, the utility function u may be a monotonically increasing concave function in order to minimize risks (and maximize reward) for the automated agent. For example, an example utility function u(x) can be:
If x<=0, u(x)=0.5x; or
If x>0, u(x)=0.1x.
For another example, the utility function u may be u(x)=−eβx where β<0.
Next, while in the i loop, the training system may update Ni by Ni (st,at)=Ni (st,at)+1; update learning rate
Outside of the i loop but still while in the t loop, the training system may update H by randomly sampling integers from 1 to k.
Once outside of the t loop, the training system may generate the averaged Q-learning table
With Algorithm 5, even though the convergence to the optimal of risk-sensitive objective function is in theory probable with a probability of 1, the proof assumes visiting every state infinitely many times whereas the actual training time is finite. The RA2-Q algorithm above can reduce the training variance further by choosing more risk-averse actions during the finite training process.
The RA2-Q algorithm trains multiple Q tables in parallel and reduces training variance by averaging multiple Q tables in the update. Moreover, in order to obtain convergence guarantee, k Q tables are trained and updated in parallel using Eq. (9) above as the update rule. To select more stable actions, the sample variance of k Q tables can be used as an approximation to the true variance and then a risk-averse {circumflex over (Q)} table (e.g., an interim Q-table) can be computed. The risk-averse Q table can then be used to select actions.
The objective function here is Eq. (7), and it can be shown that RA2-Q algorithm (also known as Algorithm 1) also converges to the optimal.
Theorem 2 Running RA2-Q algorithm for an initial Q table, then for all i ∈ {1, . . . , k}, Qi→Q* w.p. 1, hence the returned table
where Q* is the unique solution to
for all (s, a), where s′ is sampled from [·|s, a], and the corresponding policy π* of Q* satisfies {tilde over (J)}π*≥{tilde over (J)}π ∀π.
Theorem 2 follows directly from Theorem 1 (e.g., see Discussion section below for details).
From m=1 to T for each value of m (“while in the m loop”): the training system selects an action according to
While in the m loop, from i=1, . . . , N for each value of i (“while in the i loop”): the system defines the empirical Bellman operator i as
where si is randomly sampled from [·|s,a]; u is the utility function, and and x0=−1.
In some embodiments, the utility function u may be a monotonically increasing concave function in order to minimize risks (and maximize reward) for the automated agent. For example, an example utility function u(x) can be:
If x<=0, u(x)=0.5x; or
If x>0, u(x)=0.1x.
For another example, the utility function u may be u(x)=−eβx where β<0.
Once outside of the i loop, the system defines
where N is a collection of N i.i.d. samples (i.e., matrices with samples for each state-action pair (s,a) from RB). Define Q1=
From k=1, . . . , K for each value of k (“while in the k loop”): the system computes stepsize
and
Q
k+1=(1−λk)·Qk+λk·[k(Qk)−k(
where k is empirical Bellman operator constructed using a sample not in N, thus the random operators k and N are independent.
Once outside of the k loop:
[34] proposed Variance Reduced Q-learning which trains multiple Q tables in parallel and uses the averaged Q table in the update rule. It is shown that it guarantees a convergence rate which is minimax optimal. The RA2.1-Q algorithm improves upon [34] by applying a utility function to the TD error during Q updates for the purpose of further reducing variance. To select more stable actions during training, the sample variance of k Q tables are used as an approximation to the true variance and a risk-averse {circumflex over (Q)} table is computed. The risk-averse {circumflex over (Q)} table can be used to select actions.
For ∀(s,aP,aA), the system can initialize QP(s,aP,aA)=0; QA(s,aP,aA)=0; N(s,aP,aA)=0.
From t=1 to T, for each value of t (“while in the t loop”): at state st, the system computes πP(st), πA(st), which is a mixed strategy Nash equilibrium solution of the bimatrix game (QP(st), QA(st)). The system (or the automated agent) selects an action atP based on πP(st) according to ϵ-greedy strategy and selects an adversarial action at based on πA(st) according to ϵ-greedy strategy. The system observes and computes rtP, rtA and st+1.
While in the t loop, at state st+1, the system computes πP (st+1), πA(st+1), which are mixed strategies Nash equilibrium solutions of the bimatrix game (QP(st+1), QA(st+1)) N(st,atP,atA)=N(st,atP,atA)+1. The system sets learning rate
The system then updates QP, QA such that:
Q
P(st,atP,atA)=QP(st,atP,atA)+αt·[uP(rtP+γ·πP(st+1)QP(st+1)πA(st+1)−QP(st,atP,atA))−x0] (11)
where uP is a utility function and x0=−1.
Q
A(st,atP,atA)=QA(st,atP,atA)+αt·[uA(rtA+γ·πP(st+1)QA(st+1)πA(st+1)−QA(st,atP,atA))−x1] (12)
where uA is a utility function, x1=1.
Outside of the t loop, the system then returns (QP, QA).
In some embodiments, the utility function uP may be a monotonically increasing concave function in order to minimize risks (and maximize reward) for the automated agent. For example, an example utility function uP(x) can be:
If x<=0, uP(x)=0.5x; or
If x>0, uP(x)=0.1x.
For another example, the utility function uP may be uP(x)=−eβ
In some embodiments, the utility function uA may be a monotonically increasing convex function in order to maximize risks (and minimize reward). For example, the utility function may be uA(x)=eβ
In complex scenarios such as financial markets learned RL policies can be brittle. To improve robustness adversarial learning is implemented to a multi-agent learning problem in RAM-Q algorithm.
In the adversarial setting, it is assumed that there are two learning processes happening simultaneously, a main protagonist (P) and an adversary (A): the goal of protagonist is to maximize the total return as well as minimize the variance; the goal of adversary is to minimize the total return of protagonist as well as maximizing the variance. Here, one assumption is that each agent can observe its opponent's immediate reward.
Let rtp be the immediate reward received by protagonist at step t, and let rtA be the immediate reward received by adversary at step t. Then the objective functions may be chosen as follows:
The objective function for the protagonist is,
By a Taylor expansion, Eq. (13) yields:
Similarly, the objective function for the adversary is,
and by Taylor expansion, Eq. (14) yields,
Then the following guarantee holds:
Theorem 3 If the two-agent game ({tilde over (J)}P,{tilde over (J)}A) has a Nash equilibrium solution, then running the RAM-Q algorithm from initial Q tables QP, QA will converge to QP* and QA* w.p. 1. s.t. the Nash equilibrium solution (π*P, π*A) for the bimatrix game (QP*, QA*) is the Nash equilibrium solution to the game ({tilde over (J)}πP,{tilde over (J)}πA), and the equilibrium payoff are {tilde over (J)}P(s,π*P,π*A), {tilde over (J)}A(s,π*P,π*A).
Although the RAM-Q algorithm gives a solid convergence guarantee, it suffers from drawbacks like expensive computational cost and idealized assumptions, e.g., in trading markets, there may not exist a Nash equilibrium to ({tilde over (J)}P,{tilde over (J)}A), and during the training process, assumptions about the Nash equilibrium (e.g., Assumption B.3 in Discussion section below) may break [8]. Hence, another algorithm RA3-Q is developed, which relaxes these assumptions (likely at the expense of loosing theoretical guarantees) while enhancing robustness and performing well in reality.
The training system can Initialize QPi, QAi ∀i=1, . . . , k; N=0 ∈ . The system then randomly samples action choosing head integers HP, HA ∈ {1, . . . , k}.
From t=1 to T, for each value oft (“while in the t loop”): the system sets QP=QPH
Next, the system selects actions aP, aA according to {circumflex over (Q)}P, {circumflex over (Q)}A by applying ϵ-greedy strategy and generates Poisson mask M ∈ k˜Poisson(1). The system updates QiP, QiA, i=1, . . . , k according to mask M using update rules Eq. (11) and Eq. (12). The system then updates HP and HA.
Once outside of the t loop, the system returns
In RA3-Q, the objective function for the protagonist agent is Eq. (13), and the objective function for the adversary agent is Eq. (14). In order to optimize {tilde over (J)}P and {tilde over (J)}A, utility functions are applied to TD errors when updating Q tables, and training multiple Q tables in parallel is used to select actions with low variance. The full version of RA3-Q is Algorithm 7 in
RA3-Q combines (i) risk-averse using utility functions, (ii) variance reduction by having multiple Q tables, and (iii) robustness by adversarial learning. Intuitively, as the adversary is getting stronger, the protagonist experiences harder challenges, thus enhancing robustness. Compared to RAM-Q, where the returned policy (πP,πA) is a Nash equilibrium of the ({tilde over (J)}P,{tilde over (J)}A), RA3-Q does not have a convergence guarantee, however, it has several practical advantages including computational efficiency, simplicity (e.g., no strong assumptions) and more stable actions during training. For a longer discussion see Discussion section, below.
Running RA3-Q from an initial Q table,
where QP* is a solution to
∀(st,aP,aA), where st+1 is sampled from [·|st,aP,aA], and the corresponding policy π*P of QP* satisfies {tilde over (J)}π*
where QA* is a solution to
∀(st,aP,aA). Where st+1 is sampled from [·|st,aP,aA]. And the corresponding policy π*A of QA* satisfies {tilde over (J)}π*
When the environment is populated by many learning agents, empirical game theory (EGT) may be used to evaluate the performance of the agents.
In EGT each agent is a player involved in rounds of strategic interaction (games). By meta-game analysis, we can evaluate the superiority of each strategy. Our contribution is to theoretically prove that the Nash-Equilibrium of risk averse meta-game is an approximation of the Nash-Equilibrium of the population game, to our knowledge, this is the first work doing this type of risk-averse analysis.
In EGT, the dominance of strategies can be visualized by plotting the meta-game payoff tables together with the replicator dynamics. A meta game payoff table could be seen as a combination of two matrices (U|N), where each row Ni contains a discrete distribution of p players over k strategies, and each row yields a discrete profile (nπ
And each row Ri captures the rewards corresponding to the rows in N.
For example, for a game A with 2 players, and 3 strategies {π1, π2, π3} to choose from, the meta game payoff table could be constructed as follows: in the left side of the table, all of the possible combinations of strategies are listed. If there are p players and k strategies, then there are
rows, hence in game A, there are 6 rows.
Once a meta-game payoff table and the replicator dynamics are obtained, a directional field plot is computed where arrows in the strategy space indicates the direction of flow, or change, of the population composition over the strategies (see Discussion section below for two examples of directional field plots in multi-agent problems).
Previously, [33] showed that for a game ri(πi, . . . , πp)=[Ri(π1, . . . , πp)], with a meta-payoff (empirical payoff) {circumflex over (r)}i(πi, . . . , ηp), the Nash Equilibrium of {circumflex over (r)} is an approximation of Nash Equilibrium of r.
Lemma 1 [33] If x is a Nash Equilibrium for the game {circumflex over (r)}i (π1, . . . , πp), then it is a 2ϵ-Nash equilibrium for the game ri(π1, . . . , πp), where
Lemma 1 implies that if for each player, we can bound the estimation error of empirical payoff, then we can use the Nash Equilibrium of meta game as an approximation of Nash Equilibrium of the game.
As the objective is to consider risk averse payoff to evaluate strategies, instead of
r
i(πi, . . . ,πp)=[Ri(π1, . . . ,πp)],
The following equation
h
i(π1, . . . ,πp)=[Ri(π1, . . . ,πp)]−β·ar[Ri(π1, . . . ,πp)]
(where β>0) is chosen as the game payoff.
Moreover, the following equation
is chosen as meta-game payoff, where
and Rji is the stochastic payoff of player i in j-th experiment.
To the inventors' knowledge, there is no previous work on empirical game theory analysis with risk sensitive payoff. Below a theoretical analysis is presented to show that for the risk-averse payoff game, the Nash Equilibrium can still be approximated by a meta game.
Theorem 4 Under Assumption G.4, for a Normal Form Game with p players, and each player i chooses a strategy πi from a set of strategies Si={π1i, . . . , πki} and receives a meta payoff hi(π1, . . . ,πp) (Eq. (15)). If x is a Nash Equilibrium for the game ĥi(π1, . . . ,πp), then it is a 2ϵ-Nash equilibrium for the game hi(π1, . . . ,πp) with probability 1-δ if the game is played for n times, where
The experiments are conducted using the open-sourced ABIDES [11] market simulator in a simplified setting. The environment is generated by replaying publicly available real trading data for a single stock ticker (see e.g., https://lobsterdata.com/info/DataSamples.php. The setting includes one non-learning agent that replays the market deterministically [3] and learning agents. The learning agents considered are: RAQL (i.e., Algorithm 5), RA2-Q (i.e., Algorithm 1), RA2.1-Q (i.e., Algorithm 2), and RA3-Q (i.e., Algorithm 4).
A setting similar to existing implementations in ABIDES (see e.g., https://github.com/abides-sim/abides/blob/master/agent/examples/QLearningAgent.py) is used where the state space is defined by two features: current holdings and volume imbalance. Agents take one action at every time step (every second) selecting among: buy/sell with limit price base+i·K, where i ∈ {1, 2, . . . , 6} or do nothing. The immediate reward is defined by the change in the value of the portfolio (mark-to-market) and comparing against the previous time step. The comparisons are in terms of Sharpe ratio, which is a widely used measure in trading markets.
Table 2 below shows the meta-payoff table of a two player-game among three strategies: RAQL, RA2-Q and RA2.1-Q. The results show that the two proposed algorithms RA2-Q and RA2.1-Q obtained better results than RAQL. With those payoffs, the directional plot 700 and the trajectory plot 710 shown in
Table 2 shows RA2-Q and RA3-Q in terms of robustness. In this setting both agents are trained under the same conditions as a first step. Then in testing phase, two types of perturbations, an adversarial agent (trained within RA3-Q) and a noise agent (i.e., zero-intelligence) are added in the environment. The results are presented in Table 3 below, in terms of Sharpe ratio using cross validation with 80 experiments.
Table 3 shows comparison, again in terms of Sharpe ratio, with two types of perturbations: the trained adversary from RA3-Q is used in testing time, and zero-intelligence agents. It can be seen that RA3-Q obtains better results in both cases due to its enhanced robustness.
As mentioned earlier, RA2.1-Q in theory does not a convergence guarantee, however, it obtained good empirical results (better than RAQL and RA2-Q). It is an open question whether RA2.1-Q converges to the optimal of Eq.(7). Furthermore, it may be explored whether RA2.1-Q also enjoys minimax optimality convergence rate up to a logarithmic factor as in [34]. Similarly, RA3-Q does not have a convergence guarantee in the multi-agent learning scenario (when protagonist and adversary agents are learning simultaneously). However, RA3-Q obtained better empirical results than RA2-Q highlighting its robustness. In the text below, it is shown that Eq. (81) or Eq. (82) converges to optimal assuming the policy for the adversary (or protagonist) is fixed (thus, it is no longer a multi-agent learning setting).
In terms of the EGT analysis, the analysis uses a risk-averse measure based on variance (second moment), studying higher moments and other measures may be possible.
In this disclosure, four new different Q-learning algorithms are presented that augment reinforcement learning agents with risk-awareness, variance reduction, and robustness. RA2-Q and RA2.1-Q are risk-averse but use slightly different techniques to reduce variance. RAM-Q and RA3-Q are two algorithms that extend the RL agents by adding an adversarial learning layer which is expected to improve robustness. The theoretical results show convergence results for RA2-Q and RAM-Q; and in the empirical results, RA2.1-Q and RA3-Q obtained better results in a simplified trading scenario.
where u is a utility function, u(x)=−eβx where β<0; x0=−1.
As proven by [28], Lemma A.2. for the iterative procedure
where at≥0 satisfy that for any (s, a), Σt=0∞αt(s, a)=∞; and Σt=0∞αt2(s, a)<∞, then Qt→Q*, where Q* is the solution of the Bellman equation
If Lemma A.2 is true, then it is shown in [28] that the corresponding policy optimizes the objective function Eq. (7).
Before proving convergence in Lemma A.2, a more general update rule is discussed:
q
t+1(i)=(1−αt(i))qt(i)+αt(i)[(Hqt)(i)+wt(i)] (20)
where i is the independent variable (e.g., in single agent Q learning, it's the state-action pair (s, a)), qt ∈ d, H:d→d is an operator, wt denotes some random noise term and αt is learning rate with the understanding that αt(i)=0 if q(i) is not updated at time t. Denote by t the history of the algorithm up to time t,
t
={q
0(i), . . . ,qt(i),w0(i), . . . ,wt(i),α0(i), . . . , αt(i)} (21)
Recall the following essential proposition:
Proposition 1 [6] Let qt be the sequence generated by the iteration Eq. (20), if the following assumption is true:
(a). The learning rates αt(i) satisfy:
αt(i)≥0; Σt=0∞αt(i)=∞; Σt=0∞αt2(i)<∞; ∀i (22)
(b). The noise terms wt(i) satisfy
(c). The mapping H is a contraction under sup-norm.
Then qt converges to the unique solution q* of the equation Hq*=q* with probability 1.
In order to apply Proposition 1, the update rule Eq. (9) is formulated by letting
And the following is set:
where s′ is sampled from [·|s,a].
More explicitly, Hq is defined as
Next, it is shown that H is a contraction under sup-norm.
The utility function is assumed to satisfy:
Assumption A.1
i. The utility function u is strictly increasing and there exists some y0 ∈ such that u(y0)=x0.
ii. There exist positive constants ϵ, L such that
for all x≠y ∈ .
Assumption A.1 appears to exclude several important types of utility functions such as the exponential function u(x)=exp(c·x) since it does not satisfy the global Lipschitz. But this can be solved by a truncation when x is very large and by an approximation when x is very close to 0.
In addition, the immediate reward rt is assumed to always satisfy:
Assumption A.2 rt is uniformly sub-Gaussian over t with variance proxy σ2, i.e.,
Proposition 2 Suppose that Assumption A.1 and Assumption A.2 hold and 0<α<min(L−1, 1). Then there exists a real number
By Assumption A.1, and the monotonicity of ũ, there exists a ξ(x,y) ∈ [ϵ, L] such that ũ(x)−ũ(y)=ξ(x,y)·(x−y). Then the following can be obtained:
Hence,
Now that it is shown that the requirements (a) and (c) of Proposition 1 hold, it remains to check (b). By Eq.(24), [wt(s, a)|t]=0. Next, proof of (b)(ii) is presented.
By Assumption A.2,
where Γ(⋅) is the Gamma function (see [10] for details). The upper bound for [|rt|] is denoted as R1. Then [|dt|]≤R1+2∥qt∥∞, due to Assumption A.1, it implies that
[|ũ(dt)−ũ(0)|]≤[L·dt]≤L(R1+2∥qt∥∞) (36)
Hence by triangle inequality,
[|ũ(dt)|]≤ũ(0)+LR1+2L∥qt∥∞ (37)
And since
(a+b)2≤2a2+2b2 ∀a,b∈ (38)
it can be shown
(|ũ(0)|+LR1+2L∥qt∥∞)2≤2(∥ũ(0)|+LR1)2+8L2∥qt∥∞2 (39)
And since
where R2 is the upper bound for [rt2] due to Assumption A.2 ([rt2]≤4σ2·Γ(1) [10]).
Note that here ũ(0)=0, therefore:
α2[(ũ(dt))2|t]≤α2·(LR2+2LR1(1−γ)·∥qt∥∞+L(1−γ)2·∥qt∥∞2) (44)
hence,
[wt2(s,a)|t]≤2α2·(LR2+2LR1(1−γ)·∥qt∥∞+L(1−γ)2·∥qt∥∞2 (45)
if ∥qt∥∞≤1, then
[wt2(s,a)|t]≤2α2·(LR2+2LR1(1−γ)+L(1−γ)·∥qt∥∞2) (46)
if ∥qt∥∞>1, then
[wt2(s,a)|t]≤2α2·(LR2+(2LR1(1−γ)+L(1−γ)2)·∥qt∥∞2) (47)
It has been shown that qt satisfy all of the requirements in Proposition 1, so qt→q* with probability 1.
This sub-section describes the Nash-Q Learning Algorithm [16] and its convergence guarantees. Assumption B.3 below will also be used in proof of RAM-Q.
From t=1 to T, for each value of t: at state st, the training system compute πA1(st), which is a mixed strategy Nash equilibrium solution of the bimatrix game (QA1(st), QA2(st)). The system can select an action at atA based on πA1(st) according to ϵ-greedy strategy, then observe and compute rtA,rtB, atB and st+1. At state st+1, the training system computes πA1(st+1), πA2(st+1), which are mixed strategies Nash equilibrium solution of the bimatrix game (QA1(st+1), QA2(st+1)). The training system then updates NA(st,atA, atB)=NA(st,atA, atB)+1 and sets learning rate
The system can update QA1, QA2 such that
Q
A
1(st,atA,atB)=(1−αtA)·QA1(st,atA,atB)+αtA·[rtA+γ·πA1(st+1)QA1(st+1)πA2(st+1)]
Q
A
2(st,atA,atB)=(1−αtA)·QA2(st,atA,atB)+αtA·[rtB+γ·πA1(st+1)QA2(st+1)πA2(st+1)]
Assumption B.3 [16] A Nash equilibrium (π1(s), π2(s)) for any bimatrix game (Q1(s), Q2(s)) during the training process satisfies one of the following properties:
1. The Nash equilibrium is global optimal.
π1(s)Qk(s)π2(s)≥{circumflex over (π)}1(s)Qk(s){circumflex over (π)}2(s) ∀{circumflex over (π)}1(s),{circumflex over (π)}2(s), and k=1,2 (48)
2. If the Nash equilibrium is not a global optimal, then an agent receives a higher payoff when the other agent deviates from the Nash equilibrium strategy.
π1(s)Q1(s)π2(s)≤π1(s)Q1(s){circumflex over (π)}2(s) ∀{circumflex over (π)}2(s) (49)
π1(s)Qk(s)π2(s)≥{circumflex over (π)}1(s)Qk(s){circumflex over (π)}2(s) ∀{circumflex over (π)}1(s) (50)
Theorem 5 (Theorem 4, [16]) under Assumption B.3, the coupled sequences QA1, QA2 updated by Algorithm 6, converge to the Nash equilibrium Q values (Q*1,Q*2), with Q*k (k=1,2) defined as
Q
*
1(s,aA,aB)=rA(s,aA,aB)+γ·[JA(s′,π*A,π*B)] (51)
Q
*
2(s,aA,aB)=rB(s,aA,aB)+γ·[JB(s′,π*A,π*B)] (52)
where (π*A,π*B) is a Nash equilibrium solution for this stochastic game (JA,JB) and
J
A(s′,π*A,π*B)=Σt=0∞γt[rtA|π*A,π*B,s0=s′] (53)
J
B(s′,π*A,π*B)=Σt=0∞γt[rtB|π*A,π*B,s0=s′] (54)
Poisson masks M˜Poisson(1) provides parallel learning since
as T→∞, so each Q table Qi is trained in parallel. The proof of convergence of Qi for all i ∈ {1, . . . , k} is shown in the Proof of Theorem 1 above. Hence
In this section, the convergence of Algorithm 3 (RAM-Q) is proven under Assumption B.3. The convergence proof is based on the following lemma.
Lemma D.3 (Conditional Averaging Lemma [30]) Assume the learning rate αt satisfies Proposition 1(a). Then, the process Qt+1(i)=(1−αt(i))Qt(i)+αtwt(i) converges to [wt(i)|ht,αt], where ht is the history at time t.
The proof of convergence of QP is shown as an example, and the proof of convergence of QA is the same. First, the the update rule Eq. (11) is reformulated as:
set
(HPQP)(st,atP,atA)=α·uP(rtP+γ·πP(st+1)QP(st+1)πA(st+1)−QP(st,atP,atA))−α·x0+QP(st,atP,atA) (57)
and HAQA is defined symmetrically as
(HAQA)(st,atP,atA)=α·uA(rtA+γ·πP(st+1)QA(st+1)πA(st+1)−QA(st,atP,atA))−α·x1+QA(st,atP,atA) (58)
It's shown in [16] that the operator (MtP, MtA) is a γ-contraction mapping where (MtP, MtA) is defined as
M
t
P
Q
P(s)=rtP+γ·πP(s)QP(s)πA(s) (59)
M
t
A
Q
A(s)=rtA+γ·πP(s)QA(s)πA(s) (60)
Next, it is shown that (HP, HA) is a contraction under sup-norm (under Assumption A.1).
Similarly, HAQA−HA{circumflex over (Q)}A≤(1−αϵ(1−γ))·∥QA−{circumflex over (Q)}A∥∞.
Hence (HP, HA) is a (1−αϵ(1−γ))-contraction under sup-norm. By Lemma D.3, the update rules Eqs. (11) and (12) respectively converges to
Q
P(st,atP,atA)→[α·uP(rtP+γ·πP(st+1)QP(st+1)πA(st+1)−QP(st,atP,atA))−α·x0QP(st,atP,atA)] (64)
Q
A(st,atP,atA)→[α·uA(rtA+γ·πP(st+1)QA(st+1)πA(st+1)−QA(st,atP,atA))−α·x0QA(st,atP,atA)] (64)
i.e., Eqs. (11) and (12) respectively converges to Q*P, Q*A, where Q*P, Q*A are the solution to the Bellman equations
s,a
,a
[uP(rP(s,aP,aA)+γ·πP*(s′)QP*(s′)πA*(s′)−QP*(s,aP,aA))]=x0 (66)
s,a
,a
[uA(rA(s,aP,aA)+γ·πP*(s′)QA*(s′)πA*(s′)−QA*(s,aP,aA))]=x1 (68)
where (πP*,πA*) is the Nash equilibrium solution to the bimatrix game (Q*P, Q*A).
Next it is shown that (πP*,πA*) is a Nash equilibrium solution for the game with equilibrium payoffs ({tilde over (J)}P (s,πP*,πA*),{tilde over (J)}A(s,πP*,πA*)). As in [28], for any X ∈ , define P (X|s, aP, aA):×××→ be a mapping (for brevity, could be written as s,a
s,a
,a
(X)=sup{m∈|s,a
Similar to [28, 32], suppose (πP, πA) is a Nash equilibrium solution to the game ({tilde over (J)}P(s, πP, πA), {tilde over (J)}A(s, πP, πA)) then the payoffs {tilde over (J)}P(s, πP, πA), {tilde over (J)}A(s, πP, πA) are the solution to the risk-sensitive Bellman equations
{tilde over (J)}
P(s,πP,πA)=πP(s)s,a
{tilde over (J)}
A(s,πP,πA)=πP(s)s,a
And the corresponding Q tables satisfies
Q
P(s,aP,aA)=s,a
Q
A(s,aP,aA)=s,a
Note that s,a
[28] showed that Eq. (71) is equivalent to
s,a
,a
[uP(rP(s,aP,aA)+γ{tilde over (J)}P(s′,πP,πA)−QP(s,aP,aA))]=x0 (73)
s,a
,a
[uA(rA(s,aP,aA)+γ{tilde over (J)}A(s′,πP,πA)−QA(s,aP,aA))]=x1 (74)
s,a
,a
[uP(rP(s,aP,aA)+γ·πPQP(s′)πA−QP(s,aP,aA))]=x0 (75)
s,a
,a
[uA(rA(s,aP,aA)+γ·πPQA(s′)πA−QA(s,aP,aA))]=x1 (76)
which is exactly Eq. (66).
It has been shown that under Assumption B.3, Eq. (69) and Eq. (66) are equivalent. Hence Algorithm 3 (RAM-Q) converges to (QP*, QA*) s.t. the Nash equilibrium solution (πP*, πA*) for the bimatrix game (QP*, QA*) is the Nash equilibrium solution to the game and the equilibrium payoffs are {tilde over (J)}P (s, πP*, πA*); {tilde over (J)}A (s, πP*, πA*).
Previously, a short version of RA3-Q is presented in Algorithm 4 (e.g.,
The training system first initializes QPi(s, aP, aA)=0; QAi(s, aP, aA)=0 for ∀i=1, . . . , k and (s, aA, aP); N=0 ∈ The training system then randomly samples action by choosing head integers HP, HA ∈ {1, . . . , k}.
From t=1 to T, for each value of t (“while in the t loop”): the training system sets QP=QPH
the training system sets QA=QAH
The optimal actions (a′P, a′A) are defined as
While in the t loop, the training system selects actions aP, aA according to {circumflex over (Q)}P, {circumflex over (Q)}A by applying ϵ-greedy strategy. Two agents respectively execute actions aP, aA and observe (st,aP, aA, rtA, rtP, st+1)
While in the t loop, the training system generates mask M ∈ k˜Poisson(1) and updates
and N(st,aP,aA)=N(st,aP, aA)+1.
While in the t loop, from i=1, . . . , k, for each value of i (“while in the i loop”): if and when Mi=1, the training system updates QPi by
where uP is a monotonically increasing concave utility function, e.g., uP(x)=−eβ
While in the t loop, from i=1, . . . , k, for each value of i (“while in the i loop”):”): if and when Mi=1, the training system updates QAi by:
where uA is a monotonically increasing convex utility function, e.g., u(x)=eβ
Once outside of the i loop, the training system updates HP and HA by randomly sampling integers from 1 to k.
Once outside of the t loop, the training system returns
In this section, convergence issues of RA3-Q are discussed. First a simplified setting is shown where if the adversary's policy is a fixed policy π0A, the update rule for protagonist agent Eq. (81) converges to the optimal of JP(s, :, π0A). Similarly, if the protagonist's policy is a fixed policy π0P, the update rule for adversary agent Eq. (82) converges to the optimal of JA(s, π0P, :).
Poisson masks M˜Poisson(1) provides parallel learning since
as T→∞, so each Q table of protagonist/adversary, QPi, QAi, are trained in parallel respectively.
First, the proof for the convergence of the iterative procedure is shown. The protagonist agent is used as an example, and the proof for adversary is similar.
Fix the policy for adversary, then according to Proposition 3.1 in [28], for any random variable X, the following statements are equivalent:
The above proposition is used in the following context to show that the convergent point is the optimal of the objective function {tilde over (J)}P(s, :, π0A).
Compared to RAQL (Algorithm 5), RA3-Q uses multi-agent extension of MDP, where the transition function is :××→. The update rule Eq. (81) can be reformulated by letting:
Next it is shown that HP is a (1−α(1−γ)ϵ)-contractor under Assumption A.1: for any two q tables q, q′, define
By Assumption A.1 and monotonicity of ũ, for given x, y ∈ , there exists ξ(x,y) ∈ [ϵ, L] such that
Hence HP is a contractor.
By Eq. (86), [wt(s, aP, aA)|t]=0. Hence it remains to prove b(ii) in Proposition 1.
[wt2(s,aP,aA)|t]=α2·[(ũ(dt))2|t]−α2([ũ(dt)|t])2≤α2·[(ũ(dt))2|Ft] (93)
Following from the same procedures as proof for Theorom 1, condition b(ii) of Proposition 1 also holds in this case. As the learning rate satisfies condition (a), by Proposition 1, q→q*, where q* is the solution to the Bellman equation
for ∀(s, aP, aA), where s′ is sampled from [⋅|s, aP, aA].
Similarly, it can be shown that for a fixed policy for protagonist agent, the update rule Eq. (82) will guarantee that qP→qP*, where qP* is the solution to the Bellman equation
for ∀(s, aP, aA), where s′ is sampled from [⋅|s, aP, aA].
This does not imply a convergence guarantee of RA3-Q because of the protagonist/adversary's policy is fixed assumption. Only if one of the agents (e.g., the protagonist) stops learning (and its policy becomes fixed) at some point, then the other agent (adversary) will also converge. Note that in the general multi-agent learning case this is a challenge, and it is often hard to a balance between theoretical algorithms (with convergence guarantees) and practical algorithms (loosing guarantees but with good empirical results), as shown in the experimental results above.
Table 4 shows a payoff table of rock-paper-scissors game; its corresponding directional field 800 is shown in
Another example of a two-player meta-game payoff table of three strategies is in Table 5 with its corresponding directional field 900 as shown in
Theorem 6. For a Normal Form Game with p players, and each player i chooses a strategy πi from a set of strategies Si={π1i, . . . , πki}, and receives a risk averse payoff hi(π1, . . . , πP): S1× . . . ×SP→ satisfying Assumption G.4. If x is a Nash Equilibrium for the game ĥi (π1, . . . , πP), then it is a 2ϵ-Nash equilibrium for the game hi(π1, . . . , πP) with probability 1-δ if the game is played n times, where
Assumption G.4 The stochastic return h (for each player and each strategy) for each simulation has a sub-Gaussian tail. i,e, there exists ω>0 s.t.
Proof. Note that we have the following relation:
Hence, if the difference between |hi(π)−ĥi(π)| can be controlled uniformly over players and actions, then an equilibrium for the empirical game is almost an equilibrium for the game defined by the reward function. The question is how many samples n are needed to assess that a Nash equilibrium for ĥ is a 2ϵ-Nash equilibrium for h for a fixed confidence δ and a fixed ϵ.
In the following, in short, player i and the joint strategy π=(π1, . . . , πP) for p players are fixed, and denote hi=hi(π), ĥi=ĥi(π). By Hoeffding inequality,
Now, it remains to give a batch scenario for the unbiased estimator of variance penalty term. Denote
then [Vn2]=ar[Ri]=δ2, i.e., it's an unbiased estimator of the game variance. The variance of Vn2 is computed first.
Let Zji=Rji−[Ri], then [Zi]=0 and Z1i, . . . , Zni are independent. Then
Since Z1i, . . . , Zni are independent, then for distinct j, k, m,
[ZjiZki]=0; [(Zji)3Zki]=0; [(Zji)2ZkiZmi]=0. (108)
and denote
[(Zji)2(Zki)2]=μ22=δ4; [(Zji)4]=μ4. (109)
then, with algebraic manipulations, Eq. (105) can be simplified as:
by Chebyshev's inequality,
μ4≤16ω2·Γ(2) (115)
by triangle inequality,
Therefore, for per joint strategies π and per player i, the following bound exists:
hence, for
there is
Plugging the result into Eq. (101), it can be obtained:
In some embodiments, another Q-Learning algorithm is provided. The system may receive input data including training epochs T; environment env; adversarial action schedule X; exploration rate ϵ; number of models k; epoch length K, recentering sample sizes {Nm}m≥1; utility function parameter for protagonist βP<0; and utility function parameter for adversary βA>0. The training system may initialize
From t=1 to T, for each value of t: the system chooses Agent g from {A; P} according to X. Select action at according to
From i=1, N for each value of i: the system defines
where r is the reward of agent g, e.g., rP (s, a)=r (s, a)+Σi=jnγjr (sjA, ajA), ajA are selected according to
where N is a collection of N i.i.d. samples (i.e., matrices with samples for each state-action pair (s, a) from RBP); and sets Q1P=
From k=1, . . . , K for each value of k: the system computes stepsize λk=1/1+(1−γ)k and updates:
Q
k+1
g←(1−λk)·Qkg+λk·[(Qkg)−(
where is empirical Bellman operator constructed using a sample not in N, thus the random operators and are independent.
Then the system sets
From k=1, . . . , K for each value of k: the system computes
Then policies (πP*, πA*) are obtained:
J
P(s,QP*,QA*)=[Σγt·rtP|s,π*P,πA*] (127)
i.e., for any other policy πP,
[Σγt·rtP|s,πP*,πA*]≥[Σγt·rtP|s,πP,πA*] (128)
for any other policy πA,
[Σγt·rtA|s,πP*,πA*]≥[Σγt·rtA|s,πP*,πA] (129)
Each processor 1302 may be, for example, any type of general-purpose microprocessor or microcontroller, a digital signal processing (DSP) processor, an integrated circuit, a field programmable gate array (FPGA), a reconfigurable processor, a programmable read-only memory (PROM), or any combination thereof.
Memory 1304 may include a suitable combination of any type of computer memory that is located either internally or externally such as, for example, random-access memory (RAM), read-only memory (ROM), compact disc read-only memory (CDROM), electro-optical memory, magneto-optical memory, erasable programmable read-only memory (EPROM), and electrically-erasable programmable read-only memory (EEPROM), Ferroelectric RAM (FRAM) or the like. Memory 1304 may store code executable at processor 1302, which causes training system to function in manners disclosed herein. Memory 1304 includes a data storage. In some embodiments, the data storage includes a secure datastore. In some embodiments, the data storage stores received data sets, such as textual data, image data, or other types of data.
Each I/O interface 1306 enables computing device 1300 to interconnect with one or more input devices, such as a keyboard, mouse, camera, touch screen and a microphone, or with one or more output devices such as a display screen and a speaker.
Each network interface 1308 enables computing device 1300 to communicate with other components, to exchange data with other components, to access and connect to network resources, to serve applications, and perform other computing applications by connecting to a network such as network (or multiple networks) capable of carrying data including the Internet, Ethernet, plain old telephone service (POTS) line, public switch telephone network (PSTN), integrated services digital network (ISDN), digital subscriber line (DSL), coaxial cable, fiber optics, satellite, mobile, wireless (e.g. Wi-Fi, WiMAX), SS7 signaling network, fixed line, local area network, wide area network, and others, including any combination of these.
The methods disclosed herein may be implemented using a system that includes multiple computing devices 1300. The computing devices 1300 may be the same or different types of devices.
Each computing devices may be connected in various ways including directly coupled, indirectly coupled via a network, and distributed over a wide geographic area and connected via a network (which may be referred to as “cloud computing”).
For example, and without limitation, each computing device 1300 may be a server, network appliance, set-top box, embedded device, computer expansion module, personal computer, laptop, personal data assistant, cellular telephone, smartphone device, UMPC tablets, video display terminal, gaming console, electronic reading device, and wireless hypermedia device or any other computing device capable of being configured to carry out the methods described herein.
Embodiments performing the operations for anomaly detection and anomaly scoring provide certain advantages over manually assessing anomalies. For example, in some embodiments, all data points are assessed, which eliminates subjectivity involved in judgement-based sampling, and may provide more statistically significant results than random sampling. Further, the outputs produced by embodiments of system are reproducible and explainable.
The embodiments of the devices, systems and methods described herein may be implemented in a combination of both hardware and software. These embodiments may be implemented on programmable computers, each computer including at least one processor, a data storage system (including volatile memory or non-volatile memory or other data storage elements or a combination thereof), and at least one communication interface.
Program code is applied to input data to perform the functions described herein and to generate output information. The output information is applied to one or more output devices. In some embodiments, the communication interface may be a network communication interface. In embodiments in which elements may be combined, the communication interface may be a software communication interface, such as those for inter-process communication. In still other embodiments, there may be a combination of communication interfaces implemented as hardware, software, and combination thereof.
Throughout the foregoing discussion, numerous references will be made regarding servers, services, interfaces, portals, platforms, or other systems formed from computing devices. It should be appreciated that the use of such terms is deemed to represent one or more computing devices having at least one processor configured to execute software instructions stored on a computer readable tangible, non-transitory medium. For example, a server can include one or more computers operating as a web server, database server, or other type of computer server in a manner to fulfill described roles, responsibilities, or functions.
The foregoing discussion provides many example embodiments. Although each embodiment represents a single combination of inventive elements, other examples may include all possible combinations of the disclosed elements. Thus if one embodiment comprises elements A, B, and C, and a second embodiment comprises elements B and D, other remaining combinations of A, B, C, or D, may also be used.
The term “connected” or “coupled to” may include both direct coupling (in which two elements that are coupled to each other contact each other) and indirect coupling (in which at least one additional element is located between the two elements).
The technical solution of embodiments may be in the form of a software product. The software product may be stored in a non-volatile or non-transitory storage medium, which can be a compact disk read-only memory (CD-ROM), a USB flash disk, or a removable hard disk. The software product includes a number of instructions that enable a computer device (personal computer, server, or network device) to execute the methods provided by the embodiments.
The embodiments described herein are implemented by physical computer hardware, including computing devices, servers, receivers, transmitters, processors, memory, displays, and networks. The embodiments described herein provide useful physical machines and particularly configured computer hardware arrangements. The embodiments described herein are directed to electronic machines and methods implemented by electronic machines adapted for processing and transforming electromagnetic signals which represent various types of information. The embodiments described herein pervasively and integrally relate to machines, and their uses; and the embodiments described herein have no meaning or practical applicability outside their use with computer hardware, machines, and various hardware components. Substituting the physical hardware particularly configured to implement various acts for non-physical hardware, using mental steps for example, may substantially affect the way the embodiments work. Such computer hardware limitations are clearly essential elements of the embodiments described herein, and they cannot be omitted or substituted for mental means without having a material effect on the operation and structure of the embodiments described herein. The computer hardware is essential to implement the various embodiments described herein and is not merely used to perform steps expeditiously and in an efficient manner.
The embodiments and examples described herein are illustrative and non-limiting. Practical implementation of the features may incorporate a combination of some or all of the aspects, and features described herein should not be taken as indications of future or existing product plans. Applicant partakes in both foundational and applied research, and in some cases, the features described are developed on an exploratory basis.
Although the embodiments have been described in detail, it should be understood that various changes, substitutions and alterations can be made herein without departing from the scope as defined by the appended claims.
Moreover, the scope of the present application is not intended to be limited to the particular embodiments of the process, machine, manufacture, composition of matter, means, methods and steps described in the specification. As one of ordinary skill in the art will readily appreciate from the disclosure of the present invention, processes, machines, manufacture, compositions of matter, means, methods, or steps, presently existing or later to be developed, that perform substantially the same function or achieve substantially the same result as the corresponding embodiments described herein may be utilized. Accordingly, the appended claims are intended to include within their scope such processes, machines, manufacture, compositions of matter, means, methods, or steps.
This application claims the benefit of and priority to U.S. provisional patent application No. 63/209,615, filed on Jun. 11, 2021, the entire content of which is herein incorporated by reference.
Number | Date | Country | |
---|---|---|---|
63209615 | Jun 2021 | US |