SYSTEM AND METHOD FOR RISK SENSITIVE REINFORCEMENT LEARNING ARCHITECTURE

FIELD

The present disclosure generally relates to the field of computer processing, and in particular to reinforcement learning architectures in machine learning.

BACKGROUND

Reinforcement learning can be used to train and deploy computerized agents (hereinafter simply “agents”) in trading markets, however, such application may carry fundamental challenges such as high variance and costly exploration. Moreover, markets are inherently a multi-agent domain having many actors taking actions and changing the environment. To tackle these type of scenarios, agents need to exhibit certain characteristics such as risk-awareness, robustness to perturbations and low learning variance.

SUMMARY

According to an aspect, there is provided a computer-implemented system for training an automated agent. The system includes a communication interface; at least one processor; memory in communication with the at least one processor; software code stored in the memory, which when executed at the at least one processor causes the system to: instantiate an automated agent that maintains a reinforcement learning neural network and generates, according to outputs of the reinforcement learning neural network, signals for communicating task requests; receive, by way of the communication interface, a plurality of training input data including a plurality of states and a plurality of actions for the automated agent; initialize a learning table Q for the automated agent based on the plurality of states and the plurality of actions; compute a plurality of updated learning tables based on the initialized learning table Q using a utility function, the utility function comprising a monotonically increasing concave function; and generate an averaged learning table Q′ based on the plurality of updated learning tables.

In some embodiments, the automated agent is configured to select an action based on the averaged learning table Q′ for communicating one or more task requests.

In some embodiments, utility function is represented by u(x)=−e^βx, β<0.

In some embodiments, computing a plurality of updated learning tables may include: receiving, by way of the communication interface, an input parameter k and a training step parameter T; and for each training step t, where t=1, 2 . . . T: computing an interim learning table {circumflex over (Q)} based on the initialized learning table Q; selecting an action a_tfrom the plurality of actions based on the interim learning table {circumflex over (Q)} and a given state s_tfrom the plurality of states; computing a reward r_tand a next state s_t+1based on the selected action a_t; and for at least two values of i, where i=1, 2, . . . , k, computing a respective updated learning table Qⁱof the plurality of updated learning tables based on (s_t,a_t, r_t, s_t+1) and the utility function.

In some embodiments, the averaged learning table Q′ is computed as

$\frac{1}{k} \sum_{i = 1}^{k} Q^{i} .$

In some embodiments, the utility function is a first utility function and the software code, when executed at the at least one processor, further causes the system to: instantiate an adversarial agent that maintains an adversarial reinforcement learning neural network and generates, according to outputs of the adversarial reinforcement learning neural network, signals for communicating adversarial task requests; initialize an adversarial learning table Q_Afor the adversarial agent; compute a plurality of updated adversarial learning tables based on the initialized adversarial learning table Q_Ausing a second utility function, the second utility function comprising a monotonically increasing convex function; and generate an averaged adversarial learning table Q_A′ based on the plurality of updated adversarial learning tables.

In some embodiments, the adversarial agent is configured to select an adversarial action based on the averaged adversarial learning table Q_A′ to minimize a reward for the automated agent.

In some embodiments, the second utility function is represented by u^A(x)=−e^β^A^x, β^A>0.

In some embodiments, computing a plurality of updated adversarial learning tables may include: receiving, by way of the communication interface, an input parameter k and a training step parameter T; and for each training step t, where t=1, 2 . . . T: computing an interim adversarial learning table {circumflex over (Q)}^Abased on the initialized adversarial learning table Q^A; selecting an adversarial action a_t^Abased on the interim adversarial learning table {circumflex over (Q)}^Aand a given state s_tfrom the plurality of states; computing an adversarial reward r_t^Aand a next state s_t+1based on the selected adversarial action a_t^A; and for at least two values of i, where i=1, 2, . . . , k, computing a respective updated adversarial learning table Qⁱ_Aof the plurality of updated adversarial learning tables based on (s_t,a_t^A,r_t^A,s_t+1) and the second utility function.

In some embodiments, the averaged adversarial learning table Q A′ is computed as

$\frac{1}{k} \sum_{i = 1}^{k} Q_{A}^{i} .$

According to another aspect, there is provided a computer-implemented method of training an automated agent, the method including: instantiating an automated agent that maintains a reinforcement learning neural network and generates, according to outputs of the reinforcement learning neural network, signals for communicating task requests; receiving, by way of the communication interface, a plurality of training input data including a plurality of states and a plurality of actions for the automated agent; initializing a learning table Q for the automated agent based on the plurality of states and the plurality of actions; computing a plurality of updated learning tables based on the initialized learning table Q using a utility function, the utility function comprising a monotonically increasing concave function; and generating an averaged learning table Q′ based on the plurality of updated learning tables.

In some embodiments, the method may further include: selecting an action, by the automated agent, based on the averaged learning table Q′ for communicating one or more task requests.

In some embodiments, utility function is represented by u(x)=−e^βx, β<0.

In some embodiments, the averaged learning table Q′ is computed as

$\frac{1}{k} \sum_{i = 1}^{k} Q^{i} .$

In some embodiments, the utility function is a first utility function and the method may further include: instantiating an adversarial agent that maintains an adversarial reinforcement learning neural network and generates, according to outputs of the adversarial reinforcement learning neural network, signals for communicating adversarial task requests; initializing an adversarial learning table Q_Afor the adversarial agent; computing a plurality of updated adversarial learning tables based on the initialized adversarial learning table Q_Ausing a second utility function, the second utility function comprising a monotonically increasing convex function; and generating an averaged adversarial learning table Q_A′ based on the plurality of updated adversarial learning tables.

In some embodiments, the method may further include selecting an adversarial action by the adversarial agent based on the averaged adversarial learning table Q_A′ to minimize a reward for the automated agent.

In some embodiments, the second utility function is represented by u^A(x)=−e^β^A^x, β^A>0.

In some embodiments, computing a plurality of updated adversarial learning tables may include: receiving, by way of the communication interface, an input parameter k and a training step parameter T; and for each training step t, where t=1, 2 . . . T: computing an interim adversarial learning table Q_Abased on the initialized adversarial learning table Q_A; selecting an adversarial action a_t^Abased on the interim adversarial learning table Q_Aand a given state s_tfrom the plurality of states; computing an adversarial reward r_tA and a next state s_t+, based on the selected adversarial action a_tA; and for at least two values of i, where i=1, 2, . . . , k, computing a respective updated adversarial learning table Qⁱ_Aof the plurality of updated adversarial learning tables based on (s_t,a_t^A,r_t^A,s_t+1) and the second utility function.

In some embodiments, the averaged adversarial learning table Q_A′ is computed as

$\frac{1}{k} \sum_{i = 1}^{k} Q_{A}^{i} .$

Other features will become apparent from the drawings in conjunction with the following description.

BRIEF DESCRIPTION OF DRAWINGS

In the figures, embodiments are illustrated by way of example. It is to be expressly understood that the description and figures are only for the purpose of illustration and as an aid to understanding.

Embodiments will now be described, by way of example only, with reference to the attached figures, wherein in the figures:

FIG. 1A is an example schematic diagram of a training system, in accordance with an embodiment.

FIG. 1B is a schematic diagram of an automated agent, in accordance with an embodiment.

FIG. 1C is a schematic diagram of an example neural network maintained at the computer-implemented system of FIG. 1A.

FIG. 2 is an example screen from a lunar lander game, in accordance with an embodiment.

FIG. 3 shows an example risk-averse averaged Q-learning algorithm, in accordance with an embodiment.

FIG. 4 shows an example variance reduced risk-averse Q-learning algorithm, in accordance with an embodiment.

FIG. 5 shows an example risk-averse multi-agent Q-learning algorithm, in accordance with an embodiment.

FIG. 6 shows an example risk-averse adversarial averaged Q-learning algorithm, in accordance with an embodiment.

FIG. 7 shows a directional field plot and a trajectory plot of several RL algorithms, in accordance with an embodiment.

FIG. 8 shows an example risk-averse Q-learning algorithm, in accordance with an embodiment.

FIG. 9 shows an example Nash Q-learning algorithm for an agent, in accordance with an embodiment.

FIGS. 10A and 10B show an example risk-averse adversarial Q-learning algorithm. in accordance with an embodiment.

FIG. 11A shows a directional field plot of a meta-game payoff. in accordance with an embodiment.

FIG. 11B shows a trajectory plot of the meta-game payoff in FIG. 11A, in accordance with an embodiment.

FIG. 12A shows a directional field plot of another meta-game payoff, in accordance with an embodiment.

FIG. 12B shows a trajectory plot of the meta-game payoff in FIG. 12A, in accordance with an embodiment.

FIG. 13 is a schematic diagram of a computing device that implements a training system, in accordance with an embodiment.

DETAILED DESCRIPTION

Reinforcement learning (RL) is a type of machine technique that enables an agent to learn in an interactive environment by trial and error using feedback from its own actions and experiences, in order to maximize a reward. RL has been applied to a number of fields such as games [5], navigation [4], software engineering [2], industrial design [22], and finance [18]. Each of these applications has inherent difficulties which are long-standing fundamental challenges in RL, such as: limited training time, costly exploration and safety considerations, among others.

In particular, in finance, there are some examples of RL in stochastic control problems such as option pricing [19], market making [29], and optimal execution [24]. An example finance application is trading, where the system is configured to implement trained algorithms capable of automatically making trading decisions based on a set of stored rules computed by a machine [31].

In trading, the environment represents the market (and the rest of the actors). The agent's task is to take actions related to how and how much to trade, and the objective is usually to maximize profit while minimizing risk. There are several challenges in this setting such as partial observability, a large action space, a difficult definition of rewards and learning objectives [31]. In this disclosure, two properties for learning agents in realistic trading market scenarios are considered and implemented: risk assessment and robustness. In some embodiments, one or more RL algorithms are implemented with risk-averse objective functions and variance reduction techniques. In some embodiments, the RL algorithms are implemented to operate in a multi-agent learning environment, and can assume an adversary which may take over and perturb the learning process. These RL algorithms are developed to balance theoretical guarantees with practical use. Additionally, an empirical game theory analysis for multi-agent learning by considering risk-averse payoffs is performed and discussed herein.

Risk assessment is a cornerstone in financial applications. One approach is to consider risk while assessing the performance (profit) of a trading strategy. Here, risk is a quantity related to the variance (or standard deviation) of the profit and it is commonly refereed to as “volatility”. In particular, the Sharpe ratio [27] considers both the generated profit and the risk (variance) associated with a trading strategy. This objective function (Sharpe ratio) is different from traditional RL where the goal is to optimize the expected return, usually, without considerations of risk. There are works that proposed risk-sensitive RL algorithms [21, 12] and variance reduction techniques [1]. The RL algorithms discussed in this disclosure improve upon these works, by further reducing variance, for example, through the combination of using a utility function in updating multiple Q-tables in a Q-learning environment, while also having convergence guarantees and improved robustness via adversarial learning.

Deep RL has been shown to be brittle is many scenarios [15], therefore, improving robustness is important for deploying agents in realistic scenarios, such as for use in trading platforms. A line of work has improved robustness of RL agents via adversarial perturbations [23, 26]. For example, the learning framework or system may assume an adversary (who is also learning) is allowed to take over control at regular intervals. This approach has shown good experimental results in robotics [25].

Trading market can be seen as a multi-agent interaction environment. Therefore, the agents in the RL algorithms may be evaluated from the perspective of game theory. However, it may be too difficult to analyze in standard game theoretic framework since there is no normal form representation (commonly used to analyze games). Fortunately, empirical game theory [35, 38] overcomes this limitation by using the information of several rounds of repeated interactions and assuming a higher level of strategies (agents' policies). These modifications have made possible the analysis of multi-agent interactions in complex scenarios such as markets [7], and multi-agent games [33]. However, these works have not studied the interactions under risk metrics (such as Sharpe ratio), which are explored in this disclosure.

In summary, the RL algorithms disclosed, in some embodiments, combine risk-awareness, variance reduction and robustness techniques. For example, a Risk-Averse Averaged Q-Learning (e.g., RA2-Q shown in FIG. 3) and a Variance Reduced Risk-Averse Q-Learning (e.g., RA2.1-Q shown in FIG. 4) use risk-averse functions and variance reduction techniques. Then, the training framework is further configured to simulate a multi-agent scenario where an adversary that can perturb the learning process, such as, for example, a Risk-Averse Multi-Agent Q-Learning (e.g., RAM-Q in FIG. 5) which is a multi-agent version of adversarial learning with strong assumptions and theoretical guarantees. A Risk-Averse Adversarial Averaged Q-Learning (e.g., RA3-Q in FIG. 6) relaxes those assumptions and proposes a more practical algorithm that keeps the multi-agent adversarial component to improve robustness. A theoretical result is presented using empirical game theory analysis on games with risk-sensitive payoff.

A computer system is described next in which the various RL algorithms may be implemented to train one or more automated agents. FIG. 1A is a high-level schematic diagram of an example computer-implemented system 100 for training an automated agent having a neural network, exemplary of embodiments. The automated agent can be instantiated and trained by system 100 in manners disclosed herein to generate task requests.

As detailed herein, in some embodiments, system 100 includes features adapting it to perform certain specialized purposes, e.g., to function as a trading platform. In such embodiments, system 100 may be referred to as trading platform 100 or simply as platform 100 for convenience. In such embodiments, the automated agent may generate requests for tasks to be performed in relation to securities (e.g., stocks, bonds, options or other negotiable financial instruments). For example, the automated agent may generate requests to trade (e.g., buy and/or sell) securities by way of a trading venue.

Referring now to the embodiment depicted in FIG. 1A, trading platform 100 has data storage 120 storing a model for a reinforcement learning neural network. The model is used by trading platform 100 to instantiate one or more automated agents 180 (e.g., FIG. 1B) that each maintains a reinforcement learning neural network 110 (which may be referred to as a reinforcement learning network 110 or network 110 for convenience).

A processor 104 is configured to execute machine-executable instructions to train a reinforcement learning network 110 through a training engine 112. The training engine can be configured to generate signals based on one or more rewards or inventives to train automated agents 180 to perform desired tasks more optimally, e.g., to minimize and maximize certain performance metrics such as risk or variance.

The platform 100 can connect to an interface application 130 installed on user device to receive input data. Trade entities 150a, 150b can interact with the platform to receive output data and provide input data. The trade entities 150a, 150b can have at least one computing device. The platform 100 can train one or more reinforcement learning neural networks 110. The trained reinforcement learning networks 110 can be used by platform 100 or can be for transmission to trade entities 150a, 150b, in some embodiments. The platform 100 can process trade orders using the reinforcement learning network 110 in response to commands from trade entities 150a, 150b, in some embodiments.

The platform 100 can connect to different data sources 160 and databases 170 to receive input data and receive output data for storage. The input data can represent trade orders. Network 140 (or multiple networks) is capable of carrying data and can involve wired connections, wireless connections, or a combination thereof. Network 140 may involve different network communication technologies, standards and protocols, for example.

The platform 100 can include an I/O unit 102, a processor 104, communication interface 106, and data storage 120. The I/O unit 102 can enable the platform 100 to interconnect with one or more input devices, such as a keyboard, mouse, camera, touch screen and a microphone, and/or with one or more output devices such as a display screen and a speaker.

The processor 104 can execute instructions in memory 108 to implement aspects of processes described herein. The processor 104 can execute instructions in memory 108 to configure a data collection unit, interface unit (to provide control commands to interface application 130), reinforcement learning network 110, training engine 112, and other functions described herein. The processor 104 can be, for example, any type of general-purpose microprocessor or microcontroller, a digital signal processing (DSP) processor, an integrated circuit, a field programmable gate array (FPGA), a reconfigurable processor, or any combination thereof.

As depicted in FIG. 1B, automated agent 180 receives input data 185 (e.g., from one or more data sources 160 or via a data collection unit) and generates output signal 188 according to its reinforcement learning network 110. In some embodiments, the output signal 188 is transmitted to trade entities 150a, 150b for execution of a task. Reinforcement learning network 110 can refer to a neural network that implements reinforcement learning.

FIG. 1C is a schematic diagram of an example neural network 200 according to some embodiments. The example neural network 200 can include an input layer, a hidden layer, and an output layer. The neural network 200 processes input data using its layers based on reinforcement learning, for example.

Reinforcement learning is a category of machine learning that configures agents, such the automated agents 180 described herein, to take actions in an environment to maximize a notion of a reward, and in some embodiments, the reward can be maximized by minimizing risks or variances. The processor 104 is configured with machine executable instructions to instantiate an automated agent 180 that maintains a reinforcement learning neural network 110 (also referred to as a reinforcement learning network 110 for convenience), and to train the reinforcement learning network 110 of the automated agent 180 using a training engine 112. The processor 104 is configured to control the reinforcement learning network 110 to process input data in order to generate output signals. Input data may include trade orders, various feedback data (e.g., rewards), or feature selection data, or data reflective of completed tasks (e.g., executed trades), data reflective of trading schedules, etc. Output signals may include signals for communicating resource task requests, e.g., a request to trade in a certain security. For convenience, a good signal may be referred to as a “positive reward” or simply as a reward, and a bad signal may be referred as a “negative reward” or as a “punishment.

Referring again to FIG. 1A, the interface application 130 interacts with the trading platform 100 to exchange data (including control commands) and generates visual elements for display at user device. The visual elements can represent reinforcement learning networks 110 and output generated by reinforcement learning networks 110.

Memory 108 may include a suitable combination of any type of computer memory that is located either internally or externally such as, for example, random-access memory (RAM), read-only memory (ROM), compact disc read-only memory (CDROM), electro-optical memory, magneto-optical memory, erasable programmable read-only memory (EPROM), and electrically-erasable programmable read-only memory (EEPROM), Ferroelectric RAM (FRAM) or the like. Data storage devices 120 can include memory 108, databases 122, and persistent storage 124.

The communication interface 106 can enable the platform 100 to communicate with other components, to exchange data with other components, to access and connect to network resources, to serve applications, and perform other computing applications by connecting to a network (or multiple networks) capable of carrying data including the Internet, Ethernet, plain old telephone service (POTS) line, public switch telephone network (PSTN), integrated services digital network (ISDN), digital subscriber line (DSL), coaxial cable, fiber optics, satellite, mobile, wireless (e.g. Wi-Fi, WiMAX), SS7 signaling network, fixed line, local area network, wide area network, and others, including any combination of these.

The platform 100 can be operable to register and authenticate users (using a login, unique identifier, and password for example) prior to providing access to applications, a local network, network resources, other networks and network security devices. The platform 100 may serve multiple users which may operate trade entities 150a, 150b.

The data storage 120 may be configured to store information associated with or created by the components in memory 108 and may also include machine executable instructions. The data storage 120 includes a persistent storage 124 which may involve various types of storage technologies, such as solid state drives, hard disk drives, flash memory, and may be stored in various formats, such as relational databases, non-relational databases, flat files, spreadsheets, extended markup files, etc.

Other Practical Applications

As shown in FIG. 1B, automated agent 180 receives input data 185 (e.g., from one or more data sources 160 or via a data collection unit) and generates output signal 188 according to its reinforcement learning network 110. In some embodiments, the output signal 188 can be transmitted to another system, such as a control system, for executing one or more commands represented by the output signal 188.

In some embodiments, once the reinforcement learning network 110 has been trained, it generates output signal 188 reflective of its decisions to take particular actions in response to input data. Input data can include, for example, a set of data obtained from one or more data sources 160, which may be stored in databases 170 in real time or near real time.

As a practical example, an HVAC control system which may be configured to set and control heating, ventilation, and air conditioning units (HVAC) for a building, in order to efficiently manage the power consumption of HVAC units, the control system may receive sensor data representative of temperature data in a historical period. The control system may be implemented to use an automated agent 180 and a trained reinforcement learning network 110 to generate an output signal 188, which may be a resource request command signal 188 indicative of a set value or set point representing a most optimal room temperature based on the sensor data, which may be part of input data 185, representative of the temperature data in present and in a historical period (e.g., the past 72 hours or the past week).

The input data 185 may include a time series data that is gathered from sensors 160 placed at various points of the building. The measurements from the sensors 1160, which form the time series data, may be discrete in nature. For example, the time series data may include a first data value 21.5 degrees representing the detected room temperature in Celsius at time t₁, a second data value 23.3 degrees representing the detected room temperature in Celsius at time t₂, a third data value 23.6 degrees representing the detected room temperature in Celsius at time t₃, and so on.

Other input data 185 may include a target range of temperature values for the particular room or space and/or a target room temperature or a target energy consumption per hour. A reward may be generated based on the target room temperature range or value, and/or the target energy consumption per hour.

In some examples, one or more automated agents 180 may be implemented, each agent 180 for controlling the room temperature for a separate room or space within the building which the HVAC control system is monitoring.

As another example, in some embodiments, a traffic control system which may be configured to set and control traffic flow at an intersection. The traffic control system may receive sensor data representative of detected traffic flows at various points of time in a historical period. The traffic control system may use an automated agent 180 and trained reinforcement learning network 110 to control a traffic light based on input data representative of the traffic flow data in real time, and/or traffic data in the historical period (e.g., the past 4 or 24 hours).

The input data 185 may include sensor data gathered from one or more data sources 160 (e.g. sensors 160) placed at one or more points close to the traffic intersection. For example, the time series data 112 may include a first data value 3 vehicles representing the detected number of cars at time t₁, a second data value 1 vehicles representing the detected number of cars at time t₂, a third data value 5 vehicles representing the detected number of cars at time t₃, and so on.

Based on a desired traffic flow value at t_n, the automated agent 180, based on neural network 110, may then generate an output signal 188 to shorten or lengthen a red or green light signal at the intersection, in order to ensure the intersection is least likely to be congested during one or more points in time.

As yet another example, the input data 185 may include a set of measured blood pressure values or blood sugar levels in a time period measured by one or more data sources such as medical devices 160. The trained reinforcement learning network 110 may receive the input data 185 from the sensors 160 or a database 170, and generate an output signal 185 representing a predicted data value representing a future blood pressure value or a future blood sugar level. The output signal 185 representing the predicted data value may be transmitted to a health care professional for monitoring or medical purposes.

In some embodiments, as another example, an automated agent 180 in system 100 may be trained to play a video game, and more specifically, a lunar lander game 300, as shown in FIG. 2. In this game, the goal is to control the lander's two thrusters so that it quickly, but gently, settles on a target landing pad. In this example, input data 185 provided to an automated agent 180 may include, for example, X-position on the screen, Y-position on the screen, altitude (distance between the lander and the ground below it), vertical velocity, horizontal velocity, angle of the lander, whether lander is touching the ground (Boolean variable), etc.

In some embodiments, the reward may indicate a plurality of objectives including: smoothness of landing, conservation of fuel, time used to land, and distance to a target area on the landing pad. The reward, which may be a reward vector, can be used to train the neural network 110 for landing the lunar lander by the automated agent 180.

Single-Agent Reinforcement Learning

A Markov Decision Process (MDP) is defined by a set of states custom-character describing the possible configurations, a set of actions and a set of observations for each agent. A stochastic policy π_θ: ×→[0,1] parameterized by θ produces the next state according to the state transition function : ×→. The agent obtains rewards as a function of the state and agent's action r: custom-character ×→, and receives a private observation correlated with the state o:→. The initial states are determined by a distribution d₀: →[0,

Multi-Agent Reinforcement Learning

In RL, each agent i aims to maximize its own total expected return, e.g., for a Markov game with two agents, for a given initial state distribution d₀, the discounted returns are respectively:

J
¹(d₀,π¹,π²)Σ_t=0^∞γ^t custom-character [r_t¹|π¹,π²,d₀] (1)

J
²(d₀,π¹,π²)Σ_t=0^∞γ^t custom-character [r_t²|π¹,π²,d₀] (2)

where γ is a discount factor, r_t¹, r_t², t=1, 2, . . . are respectively immediate rewards for agent 1 and agent 2. A Nash equilibrium for Markov game (with two agents) is defined below.

Definition 1 [16] A Nash equilibrium point of game (J¹, J²) is a pair of strategies (π_*¹, π_*²) such that for ∀s ∈ custom-character ,

J
¹(s,π_*¹,π_*²)≥J¹(s,π¹,π_*²) ∀π¹ (3)

J
²(s,π_*¹,π_*²)≥J²(s,π_*¹,π²) ∀π² (4)

Multi-Agent Extension of MDP

A Markov game for N agents is defined by a set of states S describing the possible configurations of all agents, a set of actions custom-character ₁, . . . , _Nand a set of observations ₁, . . . , _Nfor each agent. To choose actions, each agent i uses a stochastic policy π_θ_i:_i×_i→[0,1] parameterized by θ_i, which produces the next state according to the state transition function : ×₁×, . . . ×_N→. Each agent i obtains rewards as a function of the state and agent's action r_i: custom-character ×_i→, and receives a private observation correlated with the state o_i: →_i. The initial states are determined by a distribution d₀: →[0,.

Q-learning can use a Q-table to guide an agent to find the best action. A Q-table can be generated based on available [state, action] pairs to the agent, and updated with appropriate values after an action is taken by the agent during a training step or eposide. This Q-table acts as a reference table for the agent to select the most optimal action based on each value in the table. In multi-agent Q-learning, the Q-tables are defined over joint actions for each of the agents. Each agent receives rewards according to its reward function, with transitions dependent on the actions chosen jointly by the set of agents.

Empirical Game Theory

In some embodiments, the multi-agent behaviours in a trading market can be analyzed using empirical game theory, where a player corresponds to an agent, and a strategy corresponds to a learning algorithm. Then, in a p-player game, players are involved in a single round strategic interaction. Each player i can be configured to select a strategy πⁱfrom a set of k strategy Sⁱ={π₁ⁱ, . . . , π_kⁱ} and receive a stochastic payoff Rⁱ(π¹, . . . , π^p)S¹×S²× . . . ×S^p→ custom-character . The underlying game that is usually studied is rⁱ(πⁱ, . . . , π^p)=[Rⁱ(π¹, . . . , π^p)]. In general, the payoff of player i can be denoted as pit, and the joint strategy of all players except for player i can be denoted as x⁻ⁱ.

Definition 2 A joint strategy x=(x¹, . . . , x^p)=(xⁱ, x⁻ⁱ) is a Nash equilibrium if for all i:

$\begin{matrix} 𝔼_{π \sim x} [μ^{i} (π)] = \max_{π^{i}} 𝔼_{π^{- i} \sim x^{- i}} [μ^{i} (π^{i}, π^{- i})] & (5) \end{matrix}$

Definition 3 A joint strategy x=(x¹, . . . , x^p)=(xⁱ, x⁻ⁱ) is an ϵ-Nash equilibrium if for all i:

$\begin{matrix} \max_{π^{i}} 𝔼_{π^{- i} \sim x^{- i}} [μ^{i} (π^{i}, π^{- i})] - 𝔼_{π \sim x} [μ^{i} (π)] \leq ϵ & (6) \end{matrix}$

Evolutionary dynamics can be used to analyze multi-agent interactions. An example model is replicator dynamics (RD) [36] which describes how a population evolves through time under evolutionary pressure (in the present disclosure, a population is composed by learning algorithms). RD assumes that the reproductive success is determined by interactions and their outcomes. For example, the population of a certain type increases if they have a higher fitness (in the present disclosure, this means the expected return in certain interaction) than the population average; otherwise that population share will decrease.

To view the dominance of different strategies, it is common to plot the directional field of the payoff tables using the replicator dynamics for a number of strategy profiles x in the simplex strategy space [33].

The embodiments of RL algorithms shown in FIGS. 3 to 6 are mainly situated in the broad area of safe RL [14]. In some embodiments, the robustness of learned policies can be improved by assuming two opposing learning processes: one that aims to disturb the most and another one that tries to control the perturbations [23]. This approach has been recently adapted to work with neural networks in the context of deep RL [26]. Moreover, Risk-Averse Robust Adversarial Reinforcement Learning (RARL) [25] extended this idea by combining with Averaged DQN [1], an algorithm that proposes averaging the previous k estimates to stabilize the training process. RARL trains two agents: protagonist and adversary in parallel. The goal for the protagonist agent can be set to maximize the expected return and minimize the variance of the expected return, while the goal for the adversary agent can be set to minimize the expected return and maximize the variance of the expected return. RARL showed good experimental results, but lacked theoretical guarantees and theoretical insights on the variance reduction and robustness. Multi-agent Q-learning [16] is useful for finding the optimal strategy when there exists a unique Nash equilibrium in general sum stochastic games, and this approach could also be used in adversarial RL.

Wainwright in [34] proposed a variance reduction Q-learning algorithm (V-QL) which can be seen as a variant of the SVRG algorithm in stochastic optimization [17]. Given an algorithm that converges to Q*, one of its iterates Q could be used as a proxy for Q*, and then recenter the ordinary Q-learning updates by a quantity − custom-character (Q)+(Q), where is an empirical Bellman operator, is the population Bellman operator, which is not computable, but an unbiased approximation of it could be used instead. This algorithm is shown to be convergent and enjoys minimax optimality up to a logarithmic factor.

In some embodiments, risk-averse objective functions [21] can be combined with the Q-learning algorithm to reduce variance and risk, as elaborated below.

Risk Averse Q-Learning

Shen in [28] proposed a Q-learning algorithm that is shown to converge to the optimal of a risk-sensitive objective function. In [28], the training scheme is the same as Q-learning, except that in each iteration, a utility function is applied to a temporal difference (TD) error (see e.g., Algorithm 5 in FIG. 8, further elaborated in the Discussion section below). Generally speaking, a TD error function reports back the difference between an estimated reward and the actual reward received at any given state. The larger the error, the larger the difference between the expected and actual reward.

In order to optimize the expected return as well as minimize the variance of the expected return, an expected utility of the return can be used as the objective function instead:

$\begin{matrix} {\tilde{J}}_{π} = \frac{1}{β} 𝔼_{π} [\exp (β \sum_{t = 0}^{\infty} γ^{t} r_{t})] . & (7) \end{matrix}$

By a straightforward Taylor expansion, Eq.(7) above yields:

$𝔼 [\sum_{t = 0}^{\infty} γ^{t} r_{t}] + \frac{β}{2} 𝕍ar [\sum_{t = 0}^{\infty} γ^{t} r_{t}] + O (β^{2})$

where when β<0 the objective function is risk-averse, when β=0 the objective function is risk-neutral, and when β>0 the objective function is risk-seeking.

By applying a monotonically increasing concave utility function u(x)=−exp(βx) where β<0 to the TD error, Algorithm 5 (see e.g., FIG. 8) converges to the optimal point of Eq. (7). Hence, it can be shown that:

Theorem 1 (Theorem 3.2, [28]) Running Algorithm 5 from an initial Q table, Q→Q* with probability (w.p.) 1, where Q* is the unique solution to

$𝔼_{s^{'}} [u (r, s, a) + γ \cdot \max_{a} Q^{*} (s^{'}, a) - Q^{*} (s, a))] - x_{0} = 0$

∀(s, a), where s′ is sampled from custom-character [·|s,a], and the corresponding policy π* of Q* satisfies {tilde over (J)}_π_*≥{tilde over (J)}_π ∀π.

Multi-Agent Q-Learning

[16] proposed Nash-Q, a Multi-Agent Q-learning algorithm (e.g., Algorithm 6 in FIG. 9 and further elaborated in Discussion section below) in the framework of general-sum stochastic games. When there exists a unique Nash equilibrium in the game, this algorithm is useful for finding the optimal strategy. Nash-Q assumes an agent can observe the other agent's immediate rewards and previous actions during learning. Each learning agent maintains two Q-tables, one for its own Q values, and one for the other agents' Q values. [16] showed that under strong assumptions (e.g., Assumption B.3 in Discussion section below), Algorithm 6 converges to the Nash Equilibrium.

Example Embodiments

In some example embodiments, a computer-implemented system for training an automated agent may include: a communication interface; at least one processor; memory in communication with the at least one processor; software code stored in the memory, which when executed at the at least one processor causes the system to: instantiate an automated agent that maintains a reinforcement learning neural network and generates, according to outputs of the reinforcement learning neural network, signals for communicating task requests; receive, by way of the communication interface, a plurality of training input data including a plurality of states and a plurality of actions for the automated agent; initialize a learning table Q for the automated agent based on the plurality of states and the plurality of actions; compute a plurality of updated learning tables based on the initialized learning table Q using a utility function, the utility function comprising a monotonically increasing concave function; and generate an averaged learning table Q′ based on the plurality of updated learning tables.

In some embodiments, the automated agent is configured to select an action based on the averaged learning table Q′ for communicating one or more task requests.

In some embodiments, utility function is represented by u(x)=−e^βx, β<0.

In some embodiments, the averaged learning table Q′ is computed as

$\frac{1}{k} \sum_{i = 1}^{k} Q^{i} .$

In some embodiments, the adversarial agent is configured to select an adversarial action based on the averaged adversarial learning table Q_A′ to minimize a reward for the automated agent.

In some embodiments, the second utility function is represented by u^A(x)=−e^β^A^x, β^A>0.

In some embodiments, the averaged adversarial learning table Q_A′ is computed as

$\frac{1}{k} \sum_{i = 1}^{k} Q_{A}^{i} .$

In some example embodiments, there is a computer-implemented method of training an automated agent, the method may include: instantiating an automated agent that maintains a reinforcement learning neural network and generates, according to outputs of the reinforcement learning neural network, signals for communicating task requests; receiving, by way of the communication interface, a plurality of training input data including a plurality of states and a plurality of actions for the automated agent; initializing a learning table Q for the automated agent based on the plurality of states and the plurality of actions; computing a plurality of updated learning tables based on the initialized learning table Q using a utility function, the utility function comprising a monotonically increasing concave function; and generating an averaged learning table Q′ based on the plurality of updated learning tables.

In some embodiments, the method may further include: selecting an action, by the automated agent, based on the averaged learning table Q′ for communicating one or more task requests.

In some embodiments, the utility function is represented by u(x)=−e^βx, β<0.

In some embodiments, the averaged learning table Q′ is computed as

$\frac{1}{k} \sum_{i = 1}^{k} Q^{i} .$

In some embodiments, method may further include: instantiating an adversarial agent that maintains an adversarial reinforcement learning neural network and generates, according to outputs of the adversarial reinforcement learning neural network, signals for communicating adversarial task requests; initializing an adversarial learning table Q_Afor the adversarial agent; computing a plurality of updated adversarial learning tables based on the initialized adversarial learning table Q_Ausing a second utility function, the second utility function comprising a monotonically increasing convex function; and generating an averaged adversarial learning table Q_A′ based on the plurality of updated adversarial learning tables.

In some embodiments, the second utility function is represented by u^A(x)=−e^β^A^x, β^A>0.

In some embodiments, computing a plurality of updated adversarial learning tables may include: receiving, by way of the communication interface, an input parameter k and a training step parameter T; and for each training step t, where t=1, 2 . . . T: computing an interim adversarial learning table {circumflex over (Q)}_Abased on the initialized adversarial learning table Q_A; selecting an adversarial action a_t^Abased on the interim adversarial learning table {circumflex over (Q)}_Aand a given state s_tfrom the plurality of states; computing an adversarial reward r_t^Aand a next state s_t+1based on the selected adversarial action a_tA; and for at least two values of i, where i=1, 2, . . . , k, computing a respective updated adversarial learning table Qⁱ_Aof the plurality of updated adversarial learning tables based on (s_t,a_t^A,r_t^A,s_t+1) and the second utility function.

In some embodiments, the averaged adversarial learning table Q_A′ is computed as

$\frac{1}{k} \sum_{i = 1}^{k} Q_{A}^{i} .$

More specifically, a Risk-Averse Averaged Q-Learning (e.g., RA2-Q shown in FIG. 3) and a Variance Reduced Risk-Averse Q-Learning (e.g., RA2.1-Q shown in FIG. 4) use a risk-averse utility functions and reduce variance by training multiple Q tables in parallel. Then, the training framework is further configured to simulate a multi-agent scenario where an adversary that can perturb the learning process, such as, for example, a Risk-Averse Multi-Agent Q-Learning (e.g., RAM-Q in FIG. 5) which is a multi-agent algorithm that assumes an adversary which can perturb the learning process. While RAM-Q is proven to have convergence guarantees, it also needs strong assumptions that might not hold in reality. A Risk-Averse Adversarial Averaged Q-Learning (e.g., RA3-Q in FIG. 6) relaxes those strong assumptions and proposes a more practical algorithm that keeps the multi-agent adversarial component to improve robustness.

Table 1 below briefly summarizes each of the four machine learning models and their respective convergence guarantees (or lack thereof).

TABLE 1

Summary of Four Machine Learning RL Models

Model
Brief Description
Convergence

Risk-Averse Averaged
Q-Learning with a utility function; and a
Convergence to optimal of a

Q-Learning (RA2-Q)
more stable choice of actions with
risk-averse objective

multiple Q tables
function and reduced

training variance

Variance Reduced Risk-
Use average estimation of multiple
No convergence guarantee

Averse Q-Learning
tables in Q updates; and apply utility

(RA2.1-Q)
function in Q updates

Risk-Averse Multiagent
Multi-agent Nash Q-Learning with a
Convergence similar to [16]

Q-learning (RAM-Q)
utility function; a risk-averse

protagonist agent and a risk-seeking

adversary agent; and multiple Q tables

Risk-Averse Adversarial
Multi-agent Q-Learning with a utility
No convergence guarantee

Averaged Q-Learning
function; a risk-averse protagonist

(RA3-Q)
agent and a risk-seeking adversary

agent; and multiple Q tables

Risk-Averse Averaged Q-Learning (RA2-Q)

FIG. 3 shows an example risk-averse averaged Q-learning algorithm (RA2-Q), in accordance with an embodiment. As shown, input training data may include: a number of training steps T; an exploration rate ϵ; a number of models k; risk control parameter λ_P; and a utility function parameter β. A Q-table for the automated agent is initialized, e.g., Qⁱ=0. Other values may be initialized as well: e.g., Nⁱ=0, αⁱ=1 for ∀ⁱ=1, . . . , k. A Replay Buffer may also be set: RB=Ø, and randomly sample action choosing head integers H ∈ [1, k].

From t=1 to T, for each value of t (“while in the t loop”): Q=Q^H, the training system may compute an interim Q-table {circumflex over (Q)} by

$\begin{matrix} \hat{Q} (s, a) = Q (s, a) - λ_{p} \cdot \frac{\sum_{i = 1}^{k} {(Q^{i} (s, a) - \overline{Q} (s, a))}^{2}}{k - 1} & (8) \end{matrix}$

where λ_P>0 is a constant; and

$\overline{Q} (s, a) = \frac{1}{k} \sum_{i = 1}^{k} Q^{i} (s, a)$

Next, while in the t loop, the training system may select action a_taccording to {circumflex over (Q)} by applying ϵ-greedy strategy, execute the action and get (s_t,a_t, r_t, s_t+1), which can be appended to the replay buffer RB=RB ∪ {(s_t,a_t, r_t, s_t+1)}.

The training system may, while in the t loop, generate a mask M ∈ custom-character ^k˜Poisson(1), and for i=1, . . . , k, for each value of i (“while in the i loop”):

if and when M_i=1, update the learning table Qⁱby

$\begin{matrix} Q^{i} (s_{t}, a_{t}) = Q^{i} (s_{t}, a_{t}) + α^{i} (s_{t}, a_{t}) \cdot [u (r (s_{t}, a_{t}) + γ \cdot \max_{a} Q^{i} (s_{t + 1}, a) - Q^{i} (s_{t}, a_{t})) - x_{0}] & (9) \end{matrix}$

where u is a utility function configured to minimize risks, and x₀=−1.

In some embodiments, the utility function u may be a monotonically increasing concave function in order to minimize risks (and maximize reward) for the automated agent. For example, an example utility function u(x) can be:

If x<=0, u(x)=0.5x; or

If x>0, u(x)=0.1x.

For another example, the utility function u may be u(x)=−e^βxwhere β<0.

Next, while in the i loop, the training system may update Nⁱby Nⁱ(s_t,a_t)=Nⁱ(s_t,a_t)+1; update learning rate

$α^{i} (s_{t}, a_{t}) = \frac{1}{N^{i} (s_{t}, a_{t})} .$

Outside of the i loop but still while in the t loop, the training system may update H by randomly sampling integers from 1 to k.

Once outside of the t loop, the training system may generate the averaged Q-learning table

$Q^{'} = \frac{1}{k} \sum_{i = 1}^{k} Q^{i} .$

With Algorithm 5, even though the convergence to the optimal of risk-sensitive objective function is in theory probable with a probability of 1, the proof assumes visiting every state infinitely many times whereas the actual training time is finite. The RA2-Q algorithm above can reduce the training variance further by choosing more risk-averse actions during the finite training process.

The RA2-Q algorithm trains multiple Q tables in parallel and reduces training variance by averaging multiple Q tables in the update. Moreover, in order to obtain convergence guarantee, k Q tables are trained and updated in parallel using Eq. (9) above as the update rule. To select more stable actions, the sample variance of k Q tables can be used as an approximation to the true variance and then a risk-averse {circumflex over (Q)} table (e.g., an interim Q-table) can be computed. The risk-averse Q table can then be used to select actions.

The objective function here is Eq. (7), and it can be shown that RA2-Q algorithm (also known as Algorithm 1) also converges to the optimal.

Theorem 2 Running RA2-Q algorithm for an initial Q table, then for all i ∈ {1, . . . , k}, Qⁱ→Q* w.p. 1, hence the returned table

$\frac{1}{k} \sum_{i = 1}^{k} Q^{i} \to Q^{*} w . p .1,$

where Q* is the unique solution to

$𝔼_{s^{'}} [u (r (s, a) + γ \cdot \max_{a} Q^{*} (s^{'}, a) - Q^{*} (s, a))] - x_{0} = 0$

for all (s, a), where s′ is sampled from custom-character [·|s, a], and the corresponding policy π* of Q* satisfies {tilde over (J)}_π*≥{tilde over (J)}_π ∀π.

Theorem 2 follows directly from Theorem 1 (e.g., see Discussion section below for details).

Variance Reduced Risk-Averse Q-Learning (RA2.1-Q)

FIG. 4 shows an example variance reduced risk-averse Q-learning algorithm (RA2.1-Q), in accordance with an embodiment. The input data include: training epochs T; exploration rate E; number of models k; epoch length K; recentering sample size N; and utility function parameter β<0. The training system can Initialize a number of values: Q₀=0; m=1.

From m=1 to T for each value of m (“while in the m loop”): the training system selects an action according to Q_m−1by applying ϵ-greedy strategy, execute the action and get (s,a,r(s,a),s′), update the replay buffer RB=RB ∪ (s,a,r(s,a),s′).

While in the m loop, from i=1, . . . , N for each value of i (“while in the i loop”): the system defines the empirical Bellman operator custom-character _ias

${\ddot{𝒥}}_{i} (Q) (s, a) = u (r (s, a) + γ \cdot \max_{a^{'}} Q (s_{i}, a^{'})) - x_{0}$

where s_iis randomly sampled from custom-character [·|s,a]; u is the utility function, and and x₀=−1.

If x<=0, u(x)=0.5x; or

If x>0, u(x)=0.1x.

For another example, the utility function u may be u(x)=−e^βxwhere β<0.

Once outside of the i loop, the system defines

${\tilde{𝒥}}_{N} ({\overline{Q}}_{m - 1}) = \frac{1}{N} \sum_{i \in 𝒟_{N}} {\ddot{𝒥}}_{i} ({\overline{Q}}_{m - 1}),$

where custom-character _Nis a collection of N i.i.d. samples (i.e., matrices with samples for each state-action pair (s,a) from RB). Define Q₁=Q_m−1.

From k=1, . . . , K for each value of k (“while in the k loop”): the system computes stepsize

$λ_{k} = \frac{1}{1 + (1 - γ) k}$

and

Q
_k+1=(1−λ_k)·Q_k+λ_k·[ custom-character k(Q_k)−_k(Q_m−1)+_N(Q_m−1)] (10)

where custom-character _kis empirical Bellman operator constructed using a sample not in _N, thus the random operators _kand _Nare independent.

Once outside of the k loop: Q_m=Q_K+1; m=m+1. Then once outside of the m loop, the system returns Q_m.

[34] proposed Variance Reduced Q-learning which trains multiple Q tables in parallel and uses the averaged Q table in the update rule. It is shown that it guarantees a convergence rate which is minimax optimal. The RA2.1-Q algorithm improves upon [34] by applying a utility function to the TD error during Q updates for the purpose of further reducing variance. To select more stable actions during training, the sample variance of k Q tables are used as an approximation to the true variance and a risk-averse {circumflex over (Q)} table is computed. The risk-averse {circumflex over (Q)} table can be used to select actions.

Multi-Agent Risk-Averse Q-Learning (RAM-Q)

FIG. 5 shows an example risk-averse multi-agent Q-learning (RAM-Q) algorithm, in accordance with an embodiment. The input data may include: training steps T; exploration rate ϵ; number of models k; utility function parameters β^P<0; β^A>0.

For ∀(s,a_P,a_A), the system can initialize Q^P(s,a_P,a_A)=0; Q^A(s,a_P,a_A)=0; N(s,a_P,a_A)=0.

From t=1 to T, for each value of t (“while in the t loop”): at state s_t, the system computes π^P(s_t), π^A(s_t), which is a mixed strategy Nash equilibrium solution of the bimatrix game (Q^P(s_t), Q^A(s_t)). The system (or the automated agent) selects an action a_t^Pbased on π^P(s_t) according to ϵ-greedy strategy and selects an adversarial action at based on π^A(s_t) according to ϵ-greedy strategy. The system observes and computes r_t^P, r_t^Aand s_t+1.

While in the t loop, at state s_t+1, the system computes π^P(s_t+1), π^A(s_t+1), which are mixed strategies Nash equilibrium solutions of the bimatrix game (Q^P(s_t+1), Q^A(s_t+1)) N(s_t,a_t^P,a_t^A)=N(s_t,a_t^P,a_t^A)+1. The system sets learning rate

$α_{t} = \frac{1}{N (s_{t}, a_{t}^{P}, a_{t}^{A})} .$

The system then updates Q^P, Q^Asuch that:

Q
^P(s_t,a_t^P,a_t^A)=Q^P(s_t,a_t^P,a_t^A)+α_t·[u^P(r_t^P+γ·π^P(s_t+1)Q^P(s_t+1)π^A(s_t+1)−Q^P(s_t,a_t^P,a_t^A))−x₀] (11)

where u^Pis a utility function and x₀=−1.

Q
^A(s_t,a_t^P,a_t^A)=Q^A(s_t,a_t^P,a_t^A)+α_t·[u^A(r_t^A+γ·π^P(s_t+1)Q^A(s_t+1)π^A(s_t+1)−Q^A(s_t,a_t^P,a_t^A))−x₁] (12)

where u^Ais a utility function, x₁=1.

Outside of the t loop, the system then returns (Q^P, Q^A).

In some embodiments, the utility function u^Pmay be a monotonically increasing concave function in order to minimize risks (and maximize reward) for the automated agent. For example, an example utility function u^P(x) can be:

If x<=0, u^P(x)=0.5x; or

If x>0, u^P(x)=0.1x.

For another example, the utility function u^Pmay be u^P(x)=−e^β^p_xwhere β^p<0.

In some embodiments, the utility function u^Amay be a monotonically increasing convex function in order to maximize risks (and minimize reward). For example, the utility function may be u^A(x)=e^β^A_xwhere β^A>0.

In complex scenarios such as financial markets learned RL policies can be brittle. To improve robustness adversarial learning is implemented to a multi-agent learning problem in RAM-Q algorithm.

In the adversarial setting, it is assumed that there are two learning processes happening simultaneously, a main protagonist (P) and an adversary (A): the goal of protagonist is to maximize the total return as well as minimize the variance; the goal of adversary is to minimize the total return of protagonist as well as maximizing the variance. Here, one assumption is that each agent can observe its opponent's immediate reward.

Let r_t^pbe the immediate reward received by protagonist at step t, and let r_t^Abe the immediate reward received by adversary at step t. Then the objective functions may be chosen as follows:

The objective function for the protagonist is,

$\begin{matrix} {\tilde{J}}_{π}^{P} = \frac{1}{β^{P}} 𝔼_{π} [\exp (β^{P} \sum_{t = 0}^{\infty} γ^{t} \cdot r_{t}^{P})] β^{P} < 0 & (13) \end{matrix}$

By a Taylor expansion, Eq. (13) yields:

${\tilde{J}}_{π}^{P} = 𝔼 [\sum_{t = 0} γ^{t} \cdot r_{t}^{P}] + \frac{β^{P}}{2} 𝕍ar [\sum_{t = 0} γ^{t} \cdot r_{t}^{P}] + O ({(β^{P})}^{2}) .$

Similarly, the objective function for the adversary is,

$\begin{matrix} {\tilde{J}}_{π}^{A} = \frac{1}{β^{A}} 𝔼_{π} [\exp (β^{A} \sum_{t = 0}^{\infty} γ^{t} r_{t}^{A})] β^{A} > 0 & (14) \end{matrix}$

and by Taylor expansion, Eq. (14) yields,

${\tilde{J}}_{π}^{A} = 𝔼 [\sum_{t = 0} γ^{t} \cdot r_{t}^{A}] + \frac{β^{A}}{2} 𝕍ar [\sum_{t = 0} γ^{t} \cdot r_{t}^{A}] + O ({(β^{A})}^{2}) .$

Then the following guarantee holds:

Theorem 3 If the two-agent game ({tilde over (J)}^P,{tilde over (J)}^A) has a Nash equilibrium solution, then running the RAM-Q algorithm from initial Q tables Q^P, Q^Awill converge to Q_P* and Q_A* w.p. 1. s.t. the Nash equilibrium solution (π_*^P, π_*^A) for the bimatrix game (Q_P*, Q_A*) is the Nash equilibrium solution to the game ({tilde over (J)}_π^P,{tilde over (J)}_π^A), and the equilibrium payoff are {tilde over (J)}^P(s,π_*^P,π_*^A), {tilde over (J)}^A(s,π_*^P,π_*^A).

Although the RAM-Q algorithm gives a solid convergence guarantee, it suffers from drawbacks like expensive computational cost and idealized assumptions, e.g., in trading markets, there may not exist a Nash equilibrium to ({tilde over (J)}^P,{tilde over (J)}^A), and during the training process, assumptions about the Nash equilibrium (e.g., Assumption B.3 in Discussion section below) may break [8]. Hence, another algorithm RA3-Q is developed, which relaxes these assumptions (likely at the expense of loosing theoretical guarantees) while enhancing robustness and performing well in reality.

Risk-Averse Adversarial Averaged Q-Learning (RA3-Q)

FIG. 6 shows an example risk-averse adversarial averaged Q-learning (RA3-Q) algorithm, in accordance with an embodiment. The input data includes: training steps T; exploration rate ϵ; number of models k; risk control parameters λ_P,λ_A; and utility function parameters β^P<0; β^A>0.

The training system can Initialize Q_Pⁱ, Q_Aⁱ∀i=1, . . . , k; N=0 ∈ custom-character . The system then randomly samples action choosing head integers H_P, H_A∈ {1, . . . , k}.

From t=1 to T, for each value oft (“while in the t loop”): the system sets Q_P=Q_P^H^P, then computes {circumflex over (Q)}_P, the risk-averse protagonist Q table by the k Q tables Q_Aⁱ, i=1, . . . , k. The system also sets Q_A=Q_A^H^A, then computes {circumflex over (Q)}_A, the risk-seeking protagonist Q table by the k Q tables Qⁱ_A, i=1, . . . , k.

Next, the system selects actions a_P, a_Aaccording to {circumflex over (Q)}_P, {circumflex over (Q)}_Aby applying ϵ-greedy strategy and generates Poisson mask M ∈ custom-character ^k˜Poisson(1). The system updates Qⁱ_P, Qⁱ_A, i=1, . . . , k according to mask M using update rules Eq. (11) and Eq. (12). The system then updates H_Pand H_A.

Once outside of the t loop, the system returns

$\frac{1}{k} \sum_{i = 1}^{k} Q_{P}^{i} and \frac{1}{k} \sum_{i = 1}^{k} Q_{A}^{i} .$

In RA3-Q, the objective function for the protagonist agent is Eq. (13), and the objective function for the adversary agent is Eq. (14). In order to optimize {tilde over (J)}^Pand {tilde over (J)}^A, utility functions are applied to TD errors when updating Q tables, and training multiple Q tables in parallel is used to select actions with low variance. The full version of RA3-Q is Algorithm 7 in FIGS. 10A and 10B.

RA3-Q combines (i) risk-averse using utility functions, (ii) variance reduction by having multiple Q tables, and (iii) robustness by adversarial learning. Intuitively, as the adversary is getting stronger, the protagonist experiences harder challenges, thus enhancing robustness. Compared to RAM-Q, where the returned policy (π^P,π^A) is a Nash equilibrium of the ({tilde over (J)}^P,{tilde over (J)}^A), RA3-Q does not have a convergence guarantee, however, it has several practical advantages including computational efficiency, simplicity (e.g., no strong assumptions) and more stable actions during training. For a longer discussion see Discussion section, below.

Running RA3-Q from an initial Q table,

$\frac{1}{k} \sum_{i = 1}^{k} Q_{P}^{i} \to Q^{P *} w . p .1,$

where Q^P* is a solution to

$𝔼_{s_{t}, a_{P}, a_{A}} [u^{P} (r_{t}^{P} + γ \cdot \max_{a} Q^{P *} (s_{t + 1}, a_{P}, a_{A}) - Q^{P *} (s_{t}, a_{P}, a_{A}))] = x_{0}$

∀(s_t,a_P,a_A), where s_t+1is sampled from custom-character [·|s_t,a_P,a_A], and the corresponding policy π*_Pof Q^P* satisfies {tilde over (J)}_π*_P≥{tilde over (J)}_π ∀π. In addition,

$\frac{1}{k} \sum_{i = 1}^{k} Q_{A}^{i} \to Q^{A *} w . p .1,$

where Q^A* is a solution to

$𝔼_{s_{t}, a_{P}, a_{A}} [u^{A} (r_{t}^{A} + γ \cdot \max_{a} Q^{P *} (s_{t + 1}, a_{P}, a_{A}) - Q^{P *} (s_{t}, a_{P}, a_{A}))] = x_{1}$

∀(s_t,a_P,a_A). Where s_t+1is sampled from custom-character [·|s_t,a_P,a_A]. And the corresponding policy π*_Aof Q^A* satisfies {tilde over (J)}_π*_A^A≥{tilde over (J)}_π^A∀π.

Performance Evaluated by Empirical Game Theory

When the environment is populated by many learning agents, empirical game theory (EGT) may be used to evaluate the performance of the agents.

In EGT each agent is a player involved in rounds of strategic interaction (games). By meta-game analysis, we can evaluate the superiority of each strategy. Our contribution is to theoretically prove that the Nash-Equilibrium of risk averse meta-game is an approximation of the Nash-Equilibrium of the population game, to our knowledge, this is the first work doing this type of risk-averse analysis.

In EGT, the dominance of strategies can be visualized by plotting the meta-game payoff tables together with the replicator dynamics. A meta game payoff table could be seen as a combination of two matrices (U|N), where each row N_icontains a discrete distribution of p players over k strategies, and each row yields a discrete profile (n_π₁, n_π_k) indicating exactly how many players play each strategy with Σ_jn_π_j=p. A strategy profile

$u = (\frac{n_{π_{1}}}{p}, \dots, \frac{n_{π_{k}}}{p}) .$

And each row R_icaptures the rewards corresponding to the rows in N.

For example, for a game A with 2 players, and 3 strategies {π₁, π₂, π₃} to choose from, the meta game payoff table could be constructed as follows: in the left side of the table, all of the possible combinations of strategies are listed. If there are p players and k strategies, then there are

$(\begin{matrix} p + k - 1 \\ p \end{matrix})$

rows, hence in game A, there are 6 rows.

Once a meta-game payoff table and the replicator dynamics are obtained, a directional field plot is computed where arrows in the strategy space indicates the direction of flow, or change, of the population composition over the strategies (see Discussion section below for two examples of directional field plots in multi-agent problems).

Previously, [33] showed that for a game rⁱ(πⁱ, . . . , π^p)= custom-character [Rⁱ(π¹, . . . , π^p)], with a meta-payoff (empirical payoff) {circumflex over (r)}ⁱ(πⁱ, . . . , η^p), the Nash Equilibrium of {circumflex over (r)} is an approximation of Nash Equilibrium of r.

Lemma 1 [33] If x is a Nash Equilibrium for the game {circumflex over (r)}ⁱ(π¹, . . . , π^p), then it is a 2ϵ-Nash equilibrium for the game rⁱ(π¹, . . . , π^p), where

$\in = \sup_{π, i} ❘ {\hat{r}}^{i} (π) - r^{i} (π) ❘ .$

Lemma 1 implies that if for each player, we can bound the estimation error of empirical payoff, then we can use the Nash Equilibrium of meta game as an approximation of Nash Equilibrium of the game.

As the objective is to consider risk averse payoff to evaluate strategies, instead of

r
ⁱ(πⁱ, . . . ,π^p)= custom-character [Rⁱ(π¹, . . . ,π^p)],

The following equation

h
ⁱ(π¹, . . . ,π^p)= custom-character [Rⁱ(π1, . . . ,π^p)]−β·ar[Rⁱ(π¹, . . . ,π^p)]

(where β>0) is chosen as the game payoff.

Moreover, the following equation

$\begin{matrix} {\hat{h}}^{i} (π^{i}, \dots, π^{p}) = {\bar{R}}^{i} - β \cdot [\frac{1}{n - 1} \sum_{j = 1}^{n} {(R_{j}^{i} - {\bar{R}}^{i})}^{2}] & (15) \end{matrix}$

is chosen as meta-game payoff, where

${\bar{R}}^{i} = \frac{1}{n} \sum_{j = 1}^{n} R_{j}^{i}$

and R_jⁱis the stochastic payoff of player i in j-th experiment.

To the inventors' knowledge, there is no previous work on empirical game theory analysis with risk sensitive payoff. Below a theoretical analysis is presented to show that for the risk-averse payoff game, the Nash Equilibrium can still be approximated by a meta game.

Theorem 4 Under Assumption G.4, for a Normal Form Game with p players, and each player i chooses a strategy πⁱfrom a set of strategies Sⁱ={π₁ⁱ, . . . , π_kⁱ} and receives a meta payoff hⁱ(π¹, . . . ,π^p) (Eq. (15)). If x is a Nash Equilibrium for the game ĥⁱ(π¹, . . . ,π^p), then it is a 2ϵ-Nash equilibrium for the game hⁱ(π¹, . . . ,π^p) with probability 1-δ if the game is played for n times, where

$\begin{matrix} n \geq \max {- \frac{8 R^{2}}{\in^{2}} \log [\frac{1}{4} (1 - {(1 - δ)}^{\frac{1}{❘ s^{1} ❘ \times \dots \times ❘ s^{p} ❘ \times p}})], \frac{6 4 β^{2} ω^{2} \cdot Γ (2)}{\in^{2} [1 - {(1 - δ)}^{\frac{1}{| S^{1} | \times \dots \times | S p | \times p}}]}} & (16) \end{matrix}$

Experiments

The experiments are conducted using the open-sourced ABIDES [11] market simulator in a simplified setting. The environment is generated by replaying publicly available real trading data for a single stock ticker (see e.g., https://lobsterdata.com/info/DataSamples.php. The setting includes one non-learning agent that replays the market deterministically [3] and learning agents. The learning agents considered are: RAQL (i.e., Algorithm 5), RA2-Q (i.e., Algorithm 1), RA2.1-Q (i.e., Algorithm 2), and RA3-Q (i.e., Algorithm 4).

A setting similar to existing implementations in ABIDES (see e.g., https://github.com/abides-sim/abides/blob/master/agent/examples/QLearningAgent.py) is used where the state space is defined by two features: current holdings and volume imbalance. Agents take one action at every time step (every second) selecting among: buy/sell with limit price base+i·K, where i ∈ {1, 2, . . . , 6} or do nothing. The immediate reward is defined by the change in the value of the portfolio (mark-to-market) and comparing against the previous time step. The comparisons are in terms of Sharpe ratio, which is a widely used measure in trading markets.

Table 2 below shows the meta-payoff table of a two player-game among three strategies: RAQL, RA2-Q and RA2.1-Q. The results show that the two proposed algorithms RA2-Q and RA2.1-Q obtained better results than RAQL. With those payoffs, the directional plot 700 and the trajectory plot 710 shown in FIG. 7, where black solid circles denote globally-stable equilibria, and the white circles denote unstable equilibria (saddle-points). In the directional plot 700, the plot is colored according to the speed at which the strategy mix is changing at each point; and in the trajectory plot 710, the lines show trajectories for some points over the simplex.

TABLE 2

Meta-payoff of 2 players, 3 strategies, respectively

RAQL, RA2-Q and RA2.1-Q over 80 simulations.

The return here used is Sharpe Ratio.

N_i1
N_i2
N_i3
R_i1
R_i2
R_i3

2
0
0
0.9130
0
0

1
1
0
0.7311
0.7970
0

0
2
0
0
1.0298
0

1
0
1
0.6791
0
1.0786

0
0
2
0
0
2.2177

0
1
1
0
0.7766
1.4386

Table 2 shows RA2-Q and RA3-Q in terms of robustness. In this setting both agents are trained under the same conditions as a first step. Then in testing phase, two types of perturbations, an adversarial agent (trained within RA3-Q) and a noise agent (i.e., zero-intelligence) are added in the environment. The results are presented in Table 3 below, in terms of Sharpe ratio using cross validation with 80 experiments.

Table 3 shows comparison, again in terms of Sharpe ratio, with two types of perturbations: the trained adversary from RA3-Q is used in testing time, and zero-intelligence agents. It can be seen that RA3-Q obtains better results in both cases due to its enhanced robustness.

TABLE 3

Algorithim/Setting
Adversarial Perturation
ZI Agents Perturbation

RA2-Q
0.5269
0.9538

RA3-Q
0.9347
1.0692

Discussion

As mentioned earlier, RA2.1-Q in theory does not a convergence guarantee, however, it obtained good empirical results (better than RAQL and RA2-Q). It is an open question whether RA2.1-Q converges to the optimal of Eq.(7). Furthermore, it may be explored whether RA2.1-Q also enjoys minimax optimality convergence rate up to a logarithmic factor as in [34]. Similarly, RA3-Q does not have a convergence guarantee in the multi-agent learning scenario (when protagonist and adversary agents are learning simultaneously). However, RA3-Q obtained better empirical results than RA2-Q highlighting its robustness. In the text below, it is shown that Eq. (81) or Eq. (82) converges to optimal assuming the policy for the adversary (or protagonist) is fixed (thus, it is no longer a multi-agent learning setting).

In terms of the EGT analysis, the analysis uses a risk-averse measure based on variance (second moment), studying higher moments and other measures may be possible.

In this disclosure, four new different Q-learning algorithms are presented that augment reinforcement learning agents with risk-awareness, variance reduction, and robustness. RA2-Q and RA2.1-Q are risk-averse but use slightly different techniques to reduce variance. RAM-Q and RA3-Q are two algorithms that extend the RL agents by adding an adversarial learning layer which is expected to improve robustness. The theoretical results show convergence results for RA2-Q and RAM-Q; and in the empirical results, RA2.1-Q and RA3-Q obtained better results in a simplified trading scenario.

Risk-Averse Q-Learning (RAQL) and Proof of Convergence

FIG. 8 shows an example risk-averse Q-learning algorithm (RAQL [28]; Algorithm 5). In particular, the Q-table is updated be Eq. (17) below:

$\begin{matrix} Q_{t + 1} (s_{t}, a_{t}) = Q_{t} (s_{t}, a_{t}) + α_{t} (s_{t}, a_{t}) \cdot [u (r_{t} + γ \cdot \max_{a} Q_{t} (s_{t + 1}, a) - Q_{t} (s_{t}, a_{t})) - x_{0}] & (17) \end{matrix}$

where u is a utility function, u(x)=−e^βxwhere β<0; x₀=−1.

As proven by [28], Lemma A.2. for the iterative procedure

$\begin{matrix} Q_{t + 1} (s_{t}, a_{t}) = Q_{t} (s_{t}, a_{t}) + α_{t} (s_{t}, a_{t}) [u (r_{t} + γ \cdot \max_{a} Q_{t} (s_{t + 1}, a) - Q_{t} (s_{t}, a_{t})) - x_{0}] & (18) \end{matrix}$

where a_t≥0 satisfy that for any (s, a), Σ_t=0^∞α_t(s, a)=∞; and Σ_t=0^∞α_t²(s, a)<∞, then Q_t→Q*, where Q* is the solution of the Bellman equation

$\begin{matrix} (H^{A} Q^{*}) (s, a) = α \cdot 𝔼_{s, a} [\tilde{u} (r_{t} + γ \cdot \max_{a} Q^{*} (s_{t + 1}, a) - Q^{*} (s, a))] + Q^{*} (s, a) = Q^{*} (s, a) \forall (s, a) & (19) \end{matrix}$

If Lemma A.2 is true, then it is shown in [28] that the corresponding policy optimizes the objective function Eq. (7).

Before proving convergence in Lemma A.2, a more general update rule is discussed:

q
_t+1(i)=(1−α_t(i))q_t(i)+α_t(i)[(Hq_t)(i)+w_t(i)] (20)

where i is the independent variable (e.g., in single agent Q learning, it's the state-action pair (s, a)), q_t∈ custom-character ^d, H:^d→_dis an operator, w_tdenotes some random noise term and α_tis learning rate with the understanding that α_t(i)=0 if q(i) is not updated at time t. Denote by _tthe history of the algorithm up to time t,

custom-character
_t
={q
₀(i), . . . ,q_t(i),w₀(i), . . . ,w_t(i),α₀(i), . . . , α_t(i)} (21)

Recall the following essential proposition:

Proposition 1 [6] Let q_tbe the sequence generated by the iteration Eq. (20), if the following assumption is true:

(a). The learning rates α_t(i) satisfy:

α_t(i)≥0; Σ_t=0^∞α_t(i)=∞; Σ_t=0^∞α_t²(i)<∞; ∀i (22)

(b). The noise terms w_t(i) satisfy

- (i) [w_t(i)|_t]=0 for all i and t;
- (ii) there exist constants A and B such that [w_t²(i)|_t]≤A+B∥q_t∥²for some norm ∥⋅∥ on ^d.

(c). The mapping H is a contraction under sup-norm.

Then q_tconverges to the unique solution q* of the equation Hq*=q* with probability 1.

In order to apply Proposition 1, the update rule Eq. (9) is formulated by letting

$\begin{matrix} q_{t + 1} (s, a) = (1 - \frac{α_{t} (s, a)}{α}) q_{t} (s, a) + \frac{α_{t} (s, a)}{α} [α \cdot u (d_{t}) - α \cdot x_{0} + q_{t} (s, a)] where \tilde{u} (x) := u (x) - x_{0}; d_{t} : = r_{t} + γ \cdot \max_{a} q_{t} (s_{t + 1}, a) - q_{t} (s, a) . & (23) \end{matrix}$

And the following is set:

$\begin{matrix} (H q_{t}) (s, a) = α \cdot 𝔼_{s, a} [\tilde{u} (r_{t} + γ \cdot \max_{a} q_{t} (s_{t + 1}, a) - q_{t} (s, a))] + q_{t} (s, a) & (24) \end{matrix}$

$\begin{matrix} w_{t} (s, a) = α \cdot \tilde{u} (d_{t}) - α \cdot 𝔼_{s, a} [\tilde{u} (r_{t} + γ \cdot \max_{a} q_{t} (s^{'}, a) - q_{t} (s, a))] & (25) \end{matrix}$

where s′ is sampled from custom-character [·|s,a].

More explicitly, Hq is defined as

$\begin{matrix} (H q) (s, a) = α \cdot \sum_{s^{'}} 𝒯 [s^{'} | s, a] \cdot \tilde{u} (r (s, a) + γ \cdot \max_{a^{'}} q (s^{'}, a^{'}) - q (s, a)) + q (s, a) & (26) \end{matrix}$

Next, it is shown that H is a contraction under sup-norm.

The utility function is assumed to satisfy:

Assumption A.1

i. The utility function u is strictly increasing and there exists some y₀∈ custom-character such that u(y₀)=x₀.

ii. There exist positive constants ϵ, L such that

$0 < \in \leq \frac{u (x) - u (y)}{x - y} \leq L$

for all x≠y ∈ custom-character .

Assumption A.1 appears to exclude several important types of utility functions such as the exponential function u(x)=exp(c·x) since it does not satisfy the global Lipschitz. But this can be solved by a truncation when x is very large and by an approximation when x is very close to 0.

In addition, the immediate reward r_tis assumed to always satisfy:

Assumption A.2 r_tis uniformly sub-Gaussian over t with variance proxy σ², i.e.,

$\begin{matrix} 𝔼 [r_{t}] = 0 & (27) \end{matrix}$

$\begin{matrix} \begin{matrix} 𝔼 [\exp (c \cdot r_{t})] \leq \exp (\frac{σ^{2} c^{2}}{2}) & \forall c \in ℝ \end{matrix} & (28) \end{matrix}$

Proposition 2 Suppose that Assumption A.1 and Assumption A.2 hold and 0<α<min(L⁻¹, 1). Then there exists a real number α ∈ [0,1) such that for all q, q′ ∈ custom-character ^d, ∥Hq−Hq′∥_∞≤α∥q−q′∥_∞.

$\begin{matrix} Proof . Define v (s) : = \max_{a} q (s, a) and v^{'} (s) : = \max_{a} q^{'} (s, a) . Thus, ❘ v (s) - v' (s) ❘ \leq \max_{s, a} ❘ q (s, a) - q' (s, a) ❘ = { q - q^{'} }_{\infty} & (29) \end{matrix}$

By Assumption A.1, and the monotonicity of ũ, there exists a ξ_(x,y)∈ [ϵ, L] such that ũ(x)−ũ(y)=ξ_(x,y)·(x−y). Then the following can be obtained:

$\begin{matrix} (H q) (s, a) - ({Hq}^{'}) (s, a) & (30) \end{matrix}$

$\begin{matrix} = \sum_{s^{'}} 𝒯 [s^{'} | s, a] \cdot {{αξ}_{(s, a, s^{'}, q, q^{'})} \cdot [γ v (s^{'}) - γ v^{'} (s^{'}) - q (s, a) + q^{'} (s, a)] + (q (s, a) - q^{'} (s, a))} & (31) \end{matrix}$

$\begin{matrix} \leq (1 - α (1 - γ) \sum_{s} / T [s' | s, a] \cdot ζ / /) ‖ q - q^{1} ‖_{\infty} & (32) \end{matrix}$

$\begin{matrix} \leq (1 - α (1 - γ) ε) ‖ q - q^{1} ‖_{\infty} & (33) \end{matrix}$

Hence, α=1−α(1−γ)ϵ is the required constant.

Now that it is shown that the requirements (a) and (c) of Proposition 1 hold, it remains to check (b). By Eq.(24), custom-character [w_t(s, a)|_t]=0. Next, proof of (b)(ii) is presented.

$\begin{matrix} 𝔼 [w_{t}^{2} (s, a) | ℱ_{t}] = α^{2} 𝔼 [{(ũ (d_{t}))}^{2} | ℱ_{t}] - {α^{2} (𝔼 [\tilde{u} (d_{t}) | ℱ_{t}])}^{2} & (34) \end{matrix}$

$\begin{matrix} \leq α^{2} 𝔼 [{(\tilde{u} (d_{t}))}^{2} | ℱ_{t}] & (35) \end{matrix}$

By Assumption A.2,

$𝔼 ❘ r_{t} ❘ < {(2 σ)}^{\frac{1}{2}} Γ (\frac{1}{2}),$

where Γ(⋅) is the Gamma function (see [10] for details). The upper bound for custom-character [|r_t|] is denoted as R₁. Then [|d_t|]≤R₁+2∥q_t∥_∞, due to Assumption A.1, it implies that

custom-character [|ũ(d_t)−ũ(0)|]≤[L·d_t]≤L(R₁+2∥q_t∥_∞) (36)

Hence by triangle inequality,

custom-character [|ũ(d_t)|]≤ũ(0)+LR₁+2L∥q_t∥_∞ (37)

And since

(a+b)²≤2a²+2b²∀a,b∈ custom-character (38)

it can be shown

(|ũ(0)|+LR₁+2L∥q_t∥_∞)²≤2(∥ũ(0)|+LR₁)²+8L²∥q_t∥_∞² (39)

And since

$\begin{matrix} 𝔼 [{(\tilde{u} (d_{t}) - \tilde{u} (0))}^{2} | ℱ_{t}] \leq 𝔼 [L \cdot d_{t}^{2}] & (40) \end{matrix}$

$\begin{matrix} = 𝔼 [L \cdot {(r_{t} + γ \cdot \max_{a} q_{t} (s^{'}, a) - q_{t} (s, a))}^{2}] & (41) \end{matrix}$

$\begin{matrix} = 𝔼 [L \cdot (r_{t}^{2} + 2 r_{t} \cdot (γ \cdot \max_{a} q_{t} (s^{'}, a) - q_{t} (s, a) + {(γ \cdot \max_{a} q_{t} q_{t} (s^{'}, a) - q_{t} (s, a))}^{2})] & (42) \end{matrix}$

$\begin{matrix} = L R_{2} + 2 L R_{1} (1 - γ) \cdot { q_{t} }_{\infty} + {L (1 - γ)}^{2} \cdot { q_{t} }_{\infty}^{2} & (43) \end{matrix}$

where R₂is the upper bound for [r_t²] due to Assumption A.2 ([r_t²]≤4σ²·Γ(1) [10]).

Note that here ũ(0)=0, therefore:

α²[(ũ(d_t))²|_t]≤α²·(LR₂+2LR₁(1−γ)·∥q_t∥_∞+L(1−γ)²·∥q_t∥_∞²) (44)

hence,

[w_t²(s,a)|_t]≤2α²·(LR₂+2LR₁(1−γ)·∥q_t∥_∞+L(1−γ)²·∥q_t∥_∞² (45)

if ∥q_t∥_∞≤1, then

[w_t²(s,a)|_t]≤2α²·(LR₂+2LR₁(1−γ)+L(1−γ)·∥q_t∥_∞²) (46)

if ∥q_t∥_∞>1, then

[w_t²(s,a)|_t]≤2α²·(LR₂+(2LR₁(1−γ)+L(1−γ)²)·∥q_t∥_∞²) (47)

It has been shown that q_tsatisfy all of the requirements in Proposition 1, so q_t→q* with probability 1.

Nash-Q Learning Algorithm
This sub-section describes the Nash-Q Learning Algorithm [16] and its convergence guarantees. Assumption B.3 below will also be used in proof of RAM-Q. FIG. 9 shows an example Nash Q-learning algorithm (Algorithm 6) for Agent A: for ∀(s, a_A, a_B), the training system initializes Q_A¹(s, a_A, a_B)=0; Q_A²(s, a_A, a_B)=0; N_A(s, a_A, a_B)=0.

From t=1 to T, for each value of t: at state s_t, the training system compute π_A¹(s_t), which is a mixed strategy Nash equilibrium solution of the bimatrix game (Q_A¹(s_t), Q_A²(s_t)). The system can select an action at a_t^Abased on π_A¹(s_t) according to ϵ-greedy strategy, then observe and compute r_t^A,r_t^B, a_t^Band s_t+1. At state s_t+1, the training system computes π_A¹(s_t+1), π_A²(s_t+1), which are mixed strategies Nash equilibrium solution of the bimatrix game (Q_A¹(s_t+1), Q_A²(s_t+1)). The training system then updates N_A(s_t,a_t^A, a_t^B)=N_A(s_t,a_t^A, a_t^B)+1 and sets learning rate

$α_{t}^{A} = \frac{1}{N_{A} (s_{t}, a_{t}^{A}, a_{t}^{B})} .$

The system can update Q_A¹, Q_A²such that

Q
_A
¹(s_t,a_t^A,a_t^B)=(1−α_t^A)·Q_A¹(s_t,a_t^A,a_t^B)+α_t^A·[r_t^A+γ·π_A¹(s_t+1)Q_A¹(s_t+1)π_A²(s_t+1)]

Q
_A
²(s_t,a_t^A,a_t^B)=(1−α_t^A)·Q_A²(s_t,a_t^A,a_t^B)+α_t^A·[r_t^B+γ·π_A¹(s_t+1)Q_A²(s_t+1)π_A²(s_t+1)]

Assumption B.3 [16] A Nash equilibrium (π¹(s), π²(s)) for any bimatrix game (Q¹(s), Q²(s)) during the training process satisfies one of the following properties:

1. The Nash equilibrium is global optimal.

π¹(s)Q^k(s)π²(s)≥{circumflex over (π)}¹(s)Q^k(s){circumflex over (π)}²(s) ∀{circumflex over (π)}¹(s),{circumflex over (π)}²(s), and k=1,2 (48)

2. If the Nash equilibrium is not a global optimal, then an agent receives a higher payoff when the other agent deviates from the Nash equilibrium strategy.

π¹(s)Q¹(s)π²(s)≤π¹(s)Q¹(s){circumflex over (π)}²(s) ∀{circumflex over (π)}²(s) (49)

π¹(s)Q^k(s)π²(s)≥{circumflex over (π)}¹(s)Q^k(s){circumflex over (π)}²(s) ∀{circumflex over (π)}¹(s) (50)

Theorem 5 (Theorem 4, [16]) under Assumption B.3, the coupled sequences Q_A¹, Q_A²updated by Algorithm 6, converge to the Nash equilibrium Q values (Q_*¹,Q_*²), with Q_*^k(k=1,2) defined as

Q
_*
¹(s,a^A,a^B)=r^A(s,a^A,a^B)+γ·[J^A(s′,π_*^A,π_*^B)] (51)

Q
_*
²(s,a^A,a^B)=r^B(s,a^A,a^B)+γ·[J^B(s′,π_*^A,π_*^B)] (52)

where (π_*^A,π_*^B) is a Nash equilibrium solution for this stochastic game (J^A,J^B) and

J
^A(s′,π_*^A,π_*^B)=Σ_t=0^∞γ^t[r_t^A|π_*^A,π_*^B,s₀=s′] (53)

J
^B(s′,π_*^A,π_*^B)=Σ_t=0^∞γ^t[r_t^B|π_*^A,π_*^B,s₀=s′] (54)

Proof of Theorem 2
Poisson masks M˜Poisson(1) provides parallel learning since

$Binomial (T, \frac{1}{T}) \to Poisson (1)$

as T→∞, so each Q table Qⁱis trained in parallel. The proof of convergence of Qⁱfor all i ∈ {1, . . . , k} is shown in the Proof of Theorem 1 above. Hence

$\frac{1}{k} \sum_{i = 1}^{k} Q^{i} \to Q^{*} w . p .1 .$

Proof of Convergence of RAM-Q
In this section, the convergence of Algorithm 3 (RAM-Q) is proven under Assumption B.3. The convergence proof is based on the following lemma.

Lemma D.3 (Conditional Averaging Lemma [30]) Assume the learning rate α_tsatisfies Proposition 1(a). Then, the process Q_t+1(i)=(1−α_t(i))Q_t(i)+α_tw_t(i) converges to [w_t(i)|h_t,α_t], where h_tis the history at time t.

The proof of convergence of Q^Pis shown as an example, and the proof of convergence of Q^Ais the same. First, the the update rule Eq. (11) is reformulated as:

$\begin{matrix} Q^{P} (s_{t}, a_{t}^{P}, a_{t}^{A}) = (1 - \frac{α_{t}}{α}) \cdot Q^{P} (s_{t}, a_{t}^{P}, a_{t}^{A}) + \frac{α_{t}}{α} . [α \cdot u^{P} (r_{t}^{P} + γ \cdot π^{P} (s_{t + 1}) Q^{P} (s_{t + 1}) π^{A} (s_{t + 1}) - Q^{P} (s_{t}, a_{t}^{P}, a_{t}^{A})) - α \cdot x_{0} + & (55) \end{matrix}$

$\begin{matrix} Q^{P} (s_{t}, a_{t}^{P}, a_{t}^{A})] & (56) \end{matrix}$

set

(H^PQ^P)(s_t,a_t^P,a_t^A)=α·u^P(r_t^P+γ·π^P(s_t+1)Q^P(s_t+1)π^A(s_t+1)−Q^P(s_t,a_t^P,a_t^A))−α·x₀+Q^P(s_t,a_t^P,a_t^A) (57)

and H^AQ^Ais defined symmetrically as

(H^AQ^A)(s_t,a_t^P,a_t^A)=α·u^A(r_t^A+γ·π^P(s_t+1)Q^A(s_t+1)π^A(s_t+1)−Q^A(s_t,a_t^P,a_t^A))−α·x₁+Q^A(s_t,a_t^P,a_t^A) (58)

It's shown in [16] that the operator (M_t^P, M_t^A) is a γ-contraction mapping where (M_t^P, M_t^A) is defined as

M
_t
^P
Q
^P(s)=r_t^P+γ·π^P(s)Q^P(s)π^A(s) (59)

M
_t
^A
Q
^A(s)=r_t^A+γ·π^P(s)Q^A(s)π^A(s) (60)

Next, it is shown that (H^P, H^A) is a contraction under sup-norm (under Assumption A.1).

$\begin{matrix} H^{P} Q^{P} - H^{P} {\hat{Q}}^{P} = α \cdot [ξ_{Q^{P}, {\hat{Q}}^{P}}^{P} \cdot (M^{P} Q^{P} - M^{P} {\hat{Q}}^{P} - (Q^{P} - {\hat{Q}}^{P}))] + (Q^{P} - {\hat{Q}}^{P}) & (61) \end{matrix}$

$\begin{matrix} \leq α \cdot [ξ_{Q^{P}, {\hat{Q}}^{P}}^{P} \cdot (γ - 1) { Q^{P} - {\hat{Q}}^{P} }_{\infty}] + { Q^{P} - {\hat{Q}}^{P} }_{\infty} & (62) \end{matrix}$

$\begin{matrix} \leq (1 - α \in (1 - γ)) \cdot { Q^{P} - {\hat{Q}}^{P} }_{\infty} & (63) \end{matrix}$

Similarly, H^AQ^A−H^A{circumflex over (Q)}^A≤(1−αϵ(1−γ))·∥Q^A−{circumflex over (Q)}^A∥_∞.

Hence (H^P, H^A) is a (1−αϵ(1−γ))-contraction under sup-norm. By Lemma D.3, the update rules Eqs. (11) and (12) respectively converges to

Q
^P(s_t,a_t^P,a_t^A)→[α·u^P(r_t^P+γ·π^P(s_t+1)Q^P(s_t+1)π^A(s_t+1)−Q^P(s_t,a_t^P,a_t^A))−α·x₀Q^P(s_t,a_t^P,a_t^A)] (64)

Q
^A(s_t,a_t^P,a_t^A)→[α·u^A(r_t^A+γ·π^P(s_t+1)Q^A(s_t+1)π^A(s_t+1)−Q^A(s_t,a_t^P,a_t^A))−α·x₀Q^A(s_t,a_t^P,a_t^A)] (64)

i.e., Eqs. (11) and (12) respectively converges to Q*_P, Q*^A, where Q*_P, Q*^Aare the solution to the Bellman equations

_s,a
_P
_,a
_A[u^P(r^P(s,a^P,a^A)+γ·π^P*(s′)Q_P*(s′)π^A*(s′)−Q_P*(s,a^P,a^A))]=x₀ (66)

_s,a
_P
_,a
_A[u^A(r^A(s,a^P,a^A)+γ·π^P*(s′)Q_A*(s′)π^A*(s′)−Q_A*(s,a^P,a^A))]=x₁ (68)

where (π^P*,π^A*) is the Nash equilibrium solution to the bimatrix game (Q*_P, Q*_A).

Next it is shown that (π^P*,π^A*) is a Nash equilibrium solution for the game with equilibrium payoffs ({tilde over (J)}^P(s,π^P*,π^A*),{tilde over (J)}^A(s,π^P*,π^A*)). As in [28], for any X ∈ , define ^P(X|s, a^P, a^A):×××→ be a mapping (for brevity, could be written as _s,a_P_,a_Adefined by

_s,a
_P
_,a
_A(X)=sup{m∈|_s,a_P_,a_A[u^P(X−m)]≥x₀} (68)

Similar to [28, 32], suppose (π^P, π^A) is a Nash equilibrium solution to the game ({tilde over (J)}^P(s, π^P, π^A), {tilde over (J)}^A(s, π^P, π^A)) then the payoffs {tilde over (J)}^P(s, π^P, π^A), {tilde over (J)}^A(s, π^P, π^A) are the solution to the risk-sensitive Bellman equations

{tilde over (J)}
^P(s,π^P,π^A)=π^P(s)_s,a_P_,a_A^P(r^P(s,:,:)+γ·{tilde over (J)}^P(s′,π^P,π^A))π^A(s)∀s∈ (69)

{tilde over (J)}
^A(s,π^P,π^A)=π^P(s)_s,a_P_,a_A^P(r^A(s,:,:)+γ·{tilde over (J)}^A(s′,π^P,π^A))π^A(s)∀s∈ (70)

And the corresponding Q tables satisfies

Q
_P(s,a^P,a^A)=_s,a_P_,a_A^P(r^P(s,a^P,a^A)+γ{tilde over (J)}^P(s′,π^P,π^A)) (71)

Q
_A(s,a^P,a^A)=_s,a_P_,a_A^P(r^A(s,a^P,a^A)+γ{tilde over (J)}^A(s′,π^P,π^A)) (72)

Note that _s,a_P_,a_A^Pis monotonic one-to-one mapping, so as shown in Theorem 4.6.5 in [13], (π^P, π^A) are the Nash equilibrium solution to the bimatrix game (Q_P, Q_A). Then if Q_P=Q_P*; and Q_A=Q_A* (i.e., Q_Pand Q_Aare the solution to Eq.(66)), then the Nash solution of the bimatrix game (Q_P*, Q^A*) returned by Algorithm 3 (RAM-Q) will be the Nash solution for the game ({tilde over (J)}^P, {tilde over (J)}^A).

[28] showed that Eq. (71) is equivalent to

_s,a
_P
_,a
_A[u^P(r^P(s,a^P,a^A)+γ{tilde over (J)}^P(s′,π^P,π^A)−Q_P(s,a^P,a^A))]=x₀ (73)

_s,a
_P
_,a
_A[u^A(r^A(s,a^P,a^A)+γ{tilde over (J)}^A(s′,π^P,π^A)−Q_A(s,a^P,a^A))]=x₁ (74)

Plugging Eq. (69),

_s,a
_P
_,a
_A[u^P(r^P(s,a^P,a^A)+γ·π^PQ_P(s′)π^A−Q_P(s,a^P,a^A))]=x₀ (75)

_s,a
_P
_,a
_A[u^A(r^A(s,a^P,a^A)+γ·π^PQ_A(s′)π^A−Q_A(s,a^P,a^A))]=x₁ (76)

which is exactly Eq. (66).

It has been shown that under Assumption B.3, Eq. (69) and Eq. (66) are equivalent. Hence Algorithm 3 (RAM-Q) converges to (Q_P*, Q_A*) s.t. the Nash equilibrium solution (π^P*, π^A*) for the bimatrix game (Q_P*, Q_A*) is the Nash equilibrium solution to the game and the equilibrium payoffs are {tilde over (J)}^P(s, π^P*, π^A*); {tilde over (J)}^A(s, π^P*, π^A*).

RA3-Q
Previously, a short version of RA3-Q is presented in Algorithm 4 (e.g., FIG. 6), a detailed version is presented in Algorithm 7. FIGS. 10A and 10B show an example risk-averse adversarial Q-learning algorithm (RA3-Q, Algorithm 7). The input data may include, training steps T; exploration rate E; number of models k; risk control parameters λ_P,λ_A; utility function parameters β^P<0; β^A>0.

The training system first initializes Q_Pⁱ(s, a_P, a_A)=0; Q_Aⁱ(s, a_P, a_A)=0 for ∀i=1, . . . , k and (s, a_A, a_P); N=0 ∈ The training system then randomly samples action by choosing head integers H_P, H_A∈ {1, . . . , k}.

From t=1 to T, for each value of t (“while in the t loop”): the training system sets Q_P=Q_P^H^Pand computes {circumflex over (Q)}_Pby:

$\begin{matrix} {\hat{Q}}_{P} (s, a_{P}, a_{A}) = Q_{P} (s, a_{P}, a_{A}) - λ_{P} \cdot \frac{\sum_{i = 1}^{k} {(Q_{P}^{i} (s, a_{P}, a_{A}) - {\overline{Q}}_{P} (s, a_{P}, a_{A}))}^{2}}{k - 1} λ_{P} > 0 where {\bar{Q}}_{P} (s, a_{P}, a_{A}) = \frac{1}{k} \sum_{i = 1}^{k} Q_{P}^{i} (s, a_{P}, a_{A}); & (77) \end{matrix}$

the training system sets Q_A=Q_A^H^Aand computes {circumflex over (Q)}_Aby

$\begin{matrix} {\hat{Q}}_{A} (s, a_{P}, a_{A}) = Q_{A} (s, a_{P}, a_{A}) + λ_{A} \cdot \frac{\sum_{i = 1}^{k} {(Q_{A}^{i} (s, a_{P}, a_{A}) - {\overline{Q}}_{A} (s, a_{P}, a_{A}))}^{2}}{k - 1} λ_{A} > 0 where {\bar{Q}}_{A} (s, a_{P}, a_{A}) = \frac{1}{k} \sum_{i = 1}^{k} Q_{A}^{i} (s, a_{P}, a_{A}) . & (78) \end{matrix}$

The optimal actions (a′_P, a′_A) are defined as

$\begin{matrix} {\hat{Q}}_{P} (s_{t}, a_{p}^{'}, a_{A}^{0}) = \max_{a_{P},, a_{A}} {\hat{Q}}_{P} (s_{t}, a_{P}, a_{A}) for some a_{A}^{0} & (79) \end{matrix}$

$\begin{matrix} {\hat{Q}}_{A} (s_{t}, a_{p}^{0}, a_{A}^{'}) = \max_{a_{P}, a_{A}} {\hat{Q}}_{A} (s_{t}, a_{P}, a_{A}) for some a_{P}^{0} & (80) \end{matrix}$

While in the t loop, the training system selects actions a_P, a_Aaccording to {circumflex over (Q)}_P, {circumflex over (Q)}_Aby applying ϵ-greedy strategy. Two agents respectively execute actions a_P, a_Aand observe (s_t,a_P, a_A, r_t^A, r_t^P, s_t+1)

While in the t loop, the training system generates mask M ∈ ^k˜Poisson(1) and updates

$α (s_{t}, a_{P'} a_{A}) = \frac{1}{N (s_{t} a_{P}, a_{A})}$

and N(s_t,a_P,a_A)=N(s_t,a_P, a_A)+1.

While in the t loop, from i=1, . . . , k, for each value of i (“while in the i loop”): if and when M_i=1, the training system updates Q_Pⁱby

$\begin{matrix} Q_{P}^{i} (s_{t}, a_{P}, a_{A}) = Q_{P}^{i} (s_{t}, a_{P'} a_{A}) + α (s_{t}, a_{P}, a_{A}) \cdot [u^{P} (r_{t}^{P} + γ \cdot \max_{a_{P}, a_{A}} Q_{P}^{i} (s_{t + 1}, a_{P}, a_{A}) - Q_{P}^{i} (s_{t}, a_{P}, a_{A})) - x_{0}] & (81) \end{matrix}$

where u^Pis a monotonically increasing concave utility function, e.g., u^P(x)=−e^β^P^xwhere β^P<0; and x₀=−1.

While in the t loop, from i=1, . . . , k, for each value of i (“while in the i loop”):”): if and when M_i=1, the training system updates Q_Aⁱby:

$\begin{matrix} Q_{A}^{i} (s_{t}, a_{P}, a_{A}) = Q_{A}^{i} (s_{t}, a_{P'} a_{A}) + α (s_{t}, a_{P}, a_{A}) \cdot [u^{A} (r_{t}^{A} + γ \cdot \max_{a_{P}, a_{A}} Q_{A}^{i} (s_{t + 1}, a_{P}, a_{A}) - Q_{A}^{i} (s_{t}, a_{P}, a_{A})) - x_{1}] & (82) \end{matrix}$

where u^Ais a monotonically increasing convex utility function, e.g., u(x)=e^β^A^·xwhere β^A>0; x₁=1.

Once outside of the i loop, the training system updates H_Pand H_Aby randomly sampling integers from 1 to k.

Once outside of the t loop, the training system returns

$\frac{1}{k} \sum_{i = 1}^{k} Q_{P}^{i}; \frac{1}{k} \sum_{i = 1}^{k} Q_{A}^{i} .$

In this section, convergence issues of RA3-Q are discussed. First a simplified setting is shown where if the adversary's policy is a fixed policy π₀^A, the update rule for protagonist agent Eq. (81) converges to the optimal of J^P(s, :, π₀^A). Similarly, if the protagonist's policy is a fixed policy π₀^P, the update rule for adversary agent Eq. (82) converges to the optimal of J^A(s, π₀^P, :).

Poisson masks M˜Poisson(1) provides parallel learning since

$Binomial (T, \frac{1}{T}) \to Poisson (1)$

as T→∞, so each Q table of protagonist/adversary, Q_Pⁱ, Q_Aⁱ, are trained in parallel respectively.

First, the proof for the convergence of the iterative procedure is shown. The protagonist agent is used as an example, and the proof for adversary is similar.

Fix the policy for adversary, then according to Proposition 3.1 in [28], for any random variable X, the following statements are equivalent:

$\begin{matrix} \frac{1}{β^{P}} \log 𝔼_{μ} [\exp (β^{P} \cdot X)] = m^{*} & (i) \end{matrix}$

$\begin{matrix} 𝔼_{μ} [u^{P} (X - m^{*})] = x_{0} & (ii) \end{matrix}$

The above proposition is used in the following context to show that the convergent point is the optimal of the objective function {tilde over (J)}^P(s, :, π₀^A).

Compared to RAQL (Algorithm 5), RA3-Q uses multi-agent extension of MDP, where the transition function is :××→. The update rule Eq. (81) can be reformulated by letting:

$\begin{matrix} q_{t + 1}^{P} (s, a_{P}, a_{A}) = (1 - \frac{α_{t} (s, a_{P}, a_{A})}{α}) q_{t}^{P} (s, a_{P}, a_{A}) + \frac{α_{t} (s, a_{P}, a_{A})}{α} \cdot [α \cdot u (d_{t}) - x_{0} + q_{t}^{P} (s, a_{P}, a_{A})] & (83) \end{matrix}$

$\begin{matrix} \begin{matrix} where d_{t} : = r_{t}^{P} + γ \cdot \max_{a_{P}, a_{A}} q_{t}^{P} (s^{'}, a_{P}, a_{A}) - q_{t}^{P} (s, a_{P}, a_{A}) & x_{0} = - 1 & α \in & (0, \min (L^{- 1}, 1)] \end{matrix} and set & (84) \end{matrix}$

$\begin{matrix} (H^{P} q_{t}^{P}) (s, a_{P}, a_{A}) = α \cdot 𝔼_{s, a_{P}, a_{A}} [\tilde{u} (r_{t}^{P} + γ \cdot \max_{a_{P}, a_{A}} q_{t}^{P} (s^{'}, a_{P}, a_{A}) - q_{t}^{P} (s, a_{P}, a_{A})) + q_{t}^{P} (s, a_{P'} a_{A}) & (85) \end{matrix}$

$\begin{matrix} w_{t} (s, a_{P}, a_{A}) = α \cdot \tilde{u} (d_{t}) - α \cdot 𝔼_{s, a_{P}, a_{A}} [\tilde{u} (r_{r}^{P} + γ \cdot \max_{a_{P}, a_{A}} q_{t}^{P} (s^{'}, a_{P}, a_{A}) - q_{t}^{P} (s, a_{P}, a_{A})) & (86) \end{matrix}$

$\begin{matrix} ũ (x) = u (x) - x_{0} & (87) \end{matrix}$

Next it is shown that H^Pis a (1−α(1−γ)ϵ)-contractor under Assumption A.1: for any two q tables q, q′, define

$v^{P} (s) : = \max_{a_{P}, a_{A}} q (s, a_{P}, a_{A}) and v^{P^{'}} (s) := \max_{a_{P,} a_{A}} q^{'} (s, a_{P}, a_{A}) .$

Thus,

$\begin{matrix} ❘ v^{P} (s) - v^{P^{'}} (s) ❘ \leq \max_{s, a_{P}, a_{A}} ❘ q (s, a_{P}, a_{A}) - q^{'} (s, a_{P}, a_{A}) ❘ = { q - q^{'} }_{\infty} & (88) \end{matrix}$

By Assumption A.1 and monotonicity of ũ, for given x, y ∈ , there exists ξ_(x,y)∈ [ϵ, L] such that

$\begin{matrix} \tilde{u} (x) - \tilde{u} (y) = ξ_{(x, y)} \cdot (x - y), then (H^{P} q) (s, a_{P}, a_{A}) - (H^{P} q^{'}) (s, a_{P}, a_{A}) & (89) \end{matrix}$

$\begin{matrix} = \sum_{s^{'}} 𝒫 [s^{'} | s, a_{P}, a_{A}] \cdot {{αξ}_{(s, a_{P}, a_{A}, s^{'}, q, q^{'})} \cdot [γ \cdot v^{P} (s^{'}) - γ \cdot v^{P^{'}} (s^{'}) - q (s, a_{P}, a_{A}) + q^{'} (s, a_{P}, a_{A})] + (q (s, a_{P}, a_{A}) - q^{'} (s, a_{P}, a_{A}))} & (90) \end{matrix}$

$\begin{matrix} \leq (1 - α (1 - γ) \sum_{s^{'}} 𝒫 [s^{'} | s, a_{P}, a_{A}] \cdot ξ_{(s, a_{P}, a_{A}, s^{'}, q, q^{'})}) { q - q^{'} }_{\infty} & (91) \end{matrix}$

$\begin{matrix} \leq (1 - α (1 - γ) \in) { q - q^{'} }_{\infty} & (92) \end{matrix}$

Hence H^Pis a contractor.

By Eq. (86), [w_t(s, a_P, a_A)|_t]=0. Hence it remains to prove b(ii) in Proposition 1.

[w_t²(s,a_P,a_A)|_t]=α²·[(ũ(d_t))²|_t]−α²([ũ(d_t)|_t])²≤α²·[(ũ(d_t))²|F_t] (93)

Following from the same procedures as proof for Theorom 1, condition b(ii) of Proposition 1 also holds in this case. As the learning rate satisfies condition (a), by Proposition 1, q→q*, where q* is the solution to the Bellman equation

$\begin{matrix} \begin{matrix} 𝔼_{s, a_{P}, a_{A}} [u^{P} (r_{t}^{P} + γ \cdot \max_{a_{P}, a_{A}} q (s^{'}, a_{P}, a_{A}) - q (s, a_{P}, a_{A}))] = x_{0} & π_{0}^{A} is fixed \end{matrix} & (94) \end{matrix}$

for ∀(s, a_P, a_A), where s′ is sampled from [⋅|s, a_P, a_A].

Similarly, it can be shown that for a fixed policy for protagonist agent, the update rule Eq. (82) will guarantee that q_P→q_P*, where q_P* is the solution to the Bellman equation

$\begin{matrix} \begin{matrix} 𝔼_{s, a_{P}, a_{A}} [u^{A} (r_{t}^{A} + γ \cdot \max_{a_{P}, a_{A}} q_{P} (s^{'}, a_{P}, a_{A}) - q_{P} (s, a_{P}, a_{A}))] = x_{1} & π_{0}^{P} is fixed \end{matrix} & (95) \end{matrix}$

for ∀(s, a_P, a_A), where s′ is sampled from [⋅|s, a_P, a_A].

This does not imply a convergence guarantee of RA3-Q because of the protagonist/adversary's policy is fixed assumption. Only if one of the agents (e.g., the protagonist) stops learning (and its policy becomes fixed) at some point, then the other agent (adversary) will also converge. Note that in the general multi-agent learning case this is a challenge, and it is often hard to a balance between theoretical algorithms (with convergence guarantees) and practical algorithms (loosing guarantees but with good empirical results), as shown in the experimental results above.

Meta-Game Payoff Examples and EGT Plots
Table 4 shows a payoff table of rock-paper-scissors game; its corresponding directional field 800 is shown in FIG. 11A, and its trajectory plot 810 is as shown in FIG. 11B. It can be observed from FIGS. 11A and 11B that the equilibrium of Rock-Paper-Scissors is the centroid of the strategies simplex.

TABLE 4

Payoff Table of Rock-Paper-Scissors

N_Rock
N_Paper
N_Scissors
R_Rock
R_Paper
R_Scissors

2
0
0
0
0
0

1
1
0
−1
1
0

0
2
0
0
0
0

1
0
1
1
0
−1

0
0
2
0
0
0

0
1
1
0
−1
1

Another example of a two-player meta-game payoff table of three strategies is in Table 5 with its corresponding directional field 900 as shown in FIG. 12A and its trajectory plot 910 in FIG. 12B, where the white circles denote unstable equilibria (saddle points) and black solid circles denote globally stable equilibriums.

TABLE 5

An example of a meta game payoff table of 2 players, 3 strategies.

N_i1
N_i2
N_i3
R_i1
R_i2
R_i3

2
0
0
0.5
0
0

1
1
0
0.3
0.7
0

0
2
0
0
0.9
0

1
0
1
0.35
0
0.45

0
0
2
0
0
0.6

0
1
1
0
0.66
0.38

Proof of Theorem 4
Theorem 6. For a Normal Form Game with p players, and each player i chooses a strategy πⁱfrom a set of strategies Sⁱ={π₁ⁱ, . . . , π_kⁱ}, and receives a risk averse payoff hⁱ(π¹, . . . , π^P): S¹× . . . ×S^P→ satisfying Assumption G.4. If x is a Nash Equilibrium for the game ĥⁱ(π¹, . . . , π^P), then it is a 2ϵ-Nash equilibrium for the game hⁱ(π¹, . . . , π^P) with probability 1-δ if the game is played n times, where

$\begin{matrix} n \geq \max {- \frac{8 R^{2}}{\in^{2}} \log [\frac{1}{4} (1 - {(1 - δ)}^{\frac{1}{| s^{1} | \times \dots \times | s^{p} ❘ \times p}})], \frac{6 4 β^{2} ω^{2} \cdot Γ (2)}{\in^{2} [1 - {(1 - δ)}^{\frac{1}{| s^{1} | \times_{\dots} \times | s^{p} | \times p}}]}} & (96) \end{matrix}$

Assumption G.4 The stochastic return h (for each player and each strategy) for each simulation has a sub-Gaussian tail. i,e, there exists ω>0 s.t.

$\begin{matrix} \begin{matrix} 𝔼 [\exp (c \cdot (h - 𝔼 [h]))] \leq \exp (\frac{ω^{2} c^{2}}{2}) & \forall c \in ℝ \end{matrix} and R > 0 s . t . h \in [- R, R] . & (97) \end{matrix}$

Proof. Note that we have the following relation:

$\begin{matrix} 𝔼_{π \sim x} [h^{i} (π)] = 𝔼_{π \sim x} [{\hat{h}}^{i} (π)] + 𝔼_{π - x} [h^{i} (π) - {\hat{h}}^{i} (π)] Then & (98) \end{matrix}$

$\begin{matrix} 𝔼_{π^{- i} \sim x^{- i}} [h^{i} (π^{i}, π^{- i})] = 𝔼_{π^{- i} \sim x^{- i}} [{\hat{h}}^{i} (π^{i}, π^{- i})] + 𝔼_{π^{- i} \sim x^{- i}} [h^{i} (π^{i}, π^{- i}) - {\hat{h}}^{i} (π^{i}, π^{- i})] \max_{π^{i}} 𝔼_{π^{- i} \sim x^{- i}} [h^{i} (π^{i}, π^{- i})] \leq \max_{π^{i}} 𝔼_{π^{- i} \sim x^{- i}} [{\hat{h}}^{i} (π^{i}, π^{- i})] + \max_{π^{i}} 𝔼_{π^{- i} \sim x^{- i}} [h^{i} (π^{i}, π^{- i}) - & (99) \end{matrix}$

$\begin{matrix} {\hat{h}}^{i} (π^{i}, π^{- i})] Hence, & (100) \end{matrix}$

$\begin{matrix} \leq \max_{π^{i}} 𝔼_{π^{- i} \sim x^{- i}} [h^{i} (π^{i}, π^{- i})] - 𝔼_{π \sim x} [h^{i} (π)] \leq \underset{= 0 since x is a Nash Equilibrium for {\hat{h}}^{i}}{\underset{︸}{\max_{π^{i}} 𝔼_{π^{- i} \sim x^{- i}} [{\hat{h}}^{i} (π^{i}, π^{- i})] - 𝔼_{π \sim x} [{\hat{h}}^{i} (π)]}} + & (101) \end{matrix}$

$\begin{matrix} \underset{\leq \in}{\underset{︸}{\max_{π^{i}} 𝔼_{π^{- i} \sim x^{- i}} [h^{i} (π^{i}, π^{- i}) - {\hat{h}}^{i} (π^{i}, π^{- i})]}} + \underset{\leq \in}{\underset{︸}{E_{π \sim x} [{\hat{h}}^{i} (π) - h^{i} (π)]}} & (102) \end{matrix}$

Hence, if the difference between |hⁱ(π)−ĥⁱ(π)| can be controlled uniformly over players and actions, then an equilibrium for the empirical game is almost an equilibrium for the game defined by the reward function. The question is how many samples n are needed to assess that a Nash equilibrium for ĥ is a 2ϵ-Nash equilibrium for h for a fixed confidence δ and a fixed ϵ.

In the following, in short, player i and the joint strategy π=(π¹, . . . , π^P) for p players are fixed, and denote hⁱ=hⁱ(π), ĥⁱ=ĥⁱ(π). By Hoeffding inequality,

$\begin{matrix} ℙ [| {\bar{R}}^{i} - 𝔼 [R^{i}] | \geq \frac{\in}{2}] \leq 2 \cdot \exp (- \frac{\in^{2} n}{8 R^{2}}) & (103) \end{matrix}$

Now, it remains to give a batch scenario for the unbiased estimator of variance penalty term. Denote

$V_{n}^{2} = \frac{1}{n - 1} \sum_{j = 1}^{n} {(R_{j}^{i} - {\bar{R}}^{i})}^{2},$

then [V_n²]=ar[Rⁱ]=δ², i.e., it's an unbiased estimator of the game variance. The variance of V_n²is computed first.

Let Z_jⁱ=R_jⁱ−[Rⁱ], then [Zⁱ]=0 and Z₁ⁱ, . . . , Z_nⁱare independent. Then

$\begin{matrix} V_{n}^{2} = 𝕍ar [R^{i}] = 𝕍ar [Z^{i}] . & (104) \end{matrix}$

$\begin{matrix} 𝕍ar [V_{n}^{2}] = 𝔼 [V_{n}^{4}] - {(𝔼 [V_{n}^{2}])}^{2} & (105) \end{matrix}$

$\begin{matrix} = 𝔼 [\frac{{n^{2} (\sum_{j = 1}^{n} {(Z_{j}^{i})}^{2})}^{2} - 2 n (\sum_{j = 1}^{n} {(Z_{j}^{i})}^{2}) {(\sum_{j = 1}^{n} Z_{j}^{i})}^{2} + {𝔼 (\sum_{j = 1}^{n} Z_{j}^{i})}^{4}}{{n^{2} (n - 1)}^{2}}] - σ^{4} & (106) \end{matrix}$

$\begin{matrix} = \frac{[n^{2} {𝔼 (\sum_{j = 1}^{n} {(Z_{j}^{i})}^{2})}^{2} - 2 n 𝔼 (\sum_{j = 1}^{n} {(Z_{j}^{i})}^{2}) {(\sum_{j = 1}^{n} Z_{j}^{i})}^{2} + {𝔼 (\sum_{j = 1}^{n} Z_{j}^{i})}^{4}]}{{n^{2} (n - 1)}^{2}} - σ^{4} & (107) \end{matrix}$

Since Z₁ⁱ, . . . , Z_nⁱare independent, then for distinct j, k, m,

[Z_jⁱZ_kⁱ]=0; [(Z_jⁱ)³Z_kⁱ]=0; [(Z_jⁱ)²Z_kⁱZ_mⁱ]=0. (108)

and denote

[(Z_jⁱ)²(Z_kⁱ)²]=μ₂²=δ⁴; [(Z_jⁱ)⁴]=μ₄. (109)

then, with algebraic manipulations, Eq. (105) can be simplified as:

$\begin{matrix} \begin{matrix} 𝕍ar [V_{n}^{2}] & = \frac{\begin{matrix} n^{2} (n μ_{4} + n (n - 1) μ_{2}^{2}) - \\ 2 n (n μ_{4} + n (n - 1) μ_{2}^{2}) + n μ_{4} + 3 n (n - 1) μ_{2}^{2} \end{matrix}}{{n^{2} (n - 1)}^{2}} \end{matrix} - σ^{4} & (110) \end{matrix}$

$\begin{matrix} = \frac{(n - 1) μ_{4} + (n^{2} - 2 n + 3) σ^{4}}{n (n - 1)} - σ^{4} & (111) \end{matrix}$

$\begin{matrix} = \frac{μ_{4}}{n} - \frac{σ^{4} (n - 3)}{n (n - 1)} . & (112) \end{matrix}$

by Chebyshev's inequality,

$\begin{matrix} ℙ [❘ V_{n}^{2} - 𝕍ar [R^{i}] ❘ \geq \frac{ϵ}{2 β}] \leq \frac{𝕍ar [V_{n}^{2}]}{{(\frac{ϵ}{2 β})}^{2}} & (113) \end{matrix}$

$\begin{matrix} \leq \frac{4 β^{2} (\frac{μ_{4}}{n} - \frac{σ^{4} (n - 3)}{n (n - 1)})}{ϵ^{2}} & (114) \end{matrix}$

by Assumption G.4,

μ₄≤16ω²·Γ(2) (115)

by triangle inequality,

$\begin{matrix} ℙ [❘ h^{i} - {\hat{h}}^{i} ❘ \geq ϵ] \leq ℙ [❘ 𝔼 [R^{i}] - {\overline{R}}^{i} ❘ + β \cdot ❘ V_{n}^{2} - 𝕍ar [R^{i}] ❘ \geq ϵ] & (116) \end{matrix}$

$\begin{matrix} \leq ℙ [❘ 𝔼 [R^{i}] - {\overline{R}}^{i} ❘ \geq \frac{ϵ}{2} or β \cdot ❘ V_{n}^{2} - 𝕍ar [R^{i}] ❘ \geq \frac{ϵ}{2}] & (117) \end{matrix}$

$\begin{matrix} \leq ℙ [❘ 𝔼 [R^{i}] - {\overline{R}}^{i} ❘ \geq \frac{ϵ}{2}] + ℙ [❘ V_{n}^{2} - 𝕍ar [R^{i}] ❘ \geq \frac{ϵ}{2 β}] & (118) \end{matrix}$

$\begin{matrix} \leq 2 \cdot \exp (- \frac{ϵ^{2} n}{8 R^{2}}) + \frac{4 β^{2} (\frac{16 ω^{2} \cdot Γ (2)}{n} - \frac{σ^{4} (n - 3)}{n (n - 1)})}{ϵ^{2}} & (119) \end{matrix}$

$\begin{matrix} \leq 2 \cdot \exp (- \frac{ϵ^{2} n}{8 R^{2}}) + \frac{64 β^{2} ω^{2} \cdot Γ (2)}{n ϵ^{2}} & (120) \end{matrix}$

$\begin{matrix} = f (n, ϵ) . & (121) \end{matrix}$

Therefore, for per joint strategies π and per player i, the following bound exists:

$\begin{matrix} ℙ [\sup_{π, i} ❘ h^{i} (π) - {\hat{h}}^{i} (π) ❘ < ϵ] \geq {(1 - f (n, ϵ))}^{❘ S^{1} ❘ \times \dots \times ❘ S^{P} ❘ \times p} & (122) \end{matrix}$

hence, for

$\begin{matrix} n \geq \max {- \frac{8 R^{2}}{ϵ^{2}} \log [\frac{1}{4} (1 - (1 - δ) \frac{1}{❘ S^{1} ❘ \times \dots \times ❘ S^{p} ❘ \times p})]; \frac{64 β^{2} ω^{2} \cdot Γ (2)}{ϵ^{2} [1 - (1 - δ) \frac{1}{❘ S^{1} ❘ \times \dots \times ❘ S^{p} ❘ \times p}]}} & (123) \end{matrix}$

there is

$ℙ [\sup_{π, i} ❘ h^{i} (π) - {\hat{h}}^{i} (π) ❘ < ϵ] \geq 1 - δ .$

Plugging the result into Eq. (101), it can be obtained:

$\begin{matrix} \max_{π^{i}} 𝔼_{π^{- i} ~ x^{- i}} [h^{i} (π^{i}, π^{- i})] - 𝔼_{π ~ x} [h^{i} (π)] \leq 2 ϵ & (124) \end{matrix}$

Adversarial Variance Reduced Q-Learning
In some embodiments, another Q-Learning algorithm is provided. The system may receive input data including training epochs T; environment env; adversarial action schedule X; exploration rate ϵ; number of models k; epoch length K, recentering sample sizes {N_m}_m≥1; utility function parameter for protagonist β^P<0; and utility function parameter for adversary β^A>0. The training system may initialize Q₀^P=0; Q₀^A=0; m_P=1; B=0 ∈

From t=1 to T, for each value of t: the system chooses Agent g from {A; P} according to X. Select action a_taccording to Q_m_g₋₁^gby applying ϵ-greedy strategy. The system executes the selected action and get (s_e,a_e, obs, reward, done); set RB_g=RB_g∪{(s_e,a_e, obs, reward, done)}.

From i=1, N for each value of i: the system defines

${\hat{𝒥}}_{i} (Q) (s, a) = r + γ^{n} \cdot \max_{a^{'}} Q (s^{'}, a^{'})$

where r is the reward of agent g, e.g., r_P(s, a)=r (s, a)+Σ_i=jⁿγ^jr (s_j^A, a_j^A), a_j^Aare selected according to Q_m_A^A, and s_j^A, s′ are sampled from the MDP, and are empirical Bellman operators constructed using independent samples. The system sets g=P and defines

${\tilde{𝒥}}_{N} ({\overline{Q}}_{m_{P} - 1}^{P}) = \frac{1}{N} \sum_{i \in 𝒟_{N}} {\hat{𝒥}}_{i} ({\overline{Q}}_{m_{P} - 1}^{P}),$

where _Nis a collection of N i.i.d. samples (i.e., matrices with samples for each state-action pair (s, a) from RB_P); and sets Q₁^P=Q_m_g₋₁^P.

From k=1, . . . , K for each value of k: the system computes stepsize λ_k=1/1+(1−γ)k and updates:

Q
_k+1
^g←(1−λ_k)·Q_k^g+λk·[(Q_k^g)−(Q_m_g₋₁^g)+(Q_m_g₋₁^g)],

where is empirical Bellman operator constructed using a sample not in _N, thus the random operators and are independent.

Then the system sets Q_m_g^g=Q_K+1^g, m_g=m_g+1, g=A.

From k=1, . . . , K for each value of k: the system computes

$\begin{matrix} \begin{matrix} B (s_{m_{A}}, a_{m_{A}}) = B (s_{m_{A}}, a_{m_{A}}) + 1 \\ α_{m_{A}} = \frac{1}{B (s_{m_{A}}, a_{m_{A}})} \\ Q_{m_{A}}^{A} \leftarrow (1 - α_{m_{A}}) \cdot Q_{m_{A}}^{A} + α_{m_{A}} \cdot \hat{𝒥} (Q_{m_{A}}^{A}) \end{matrix} & (126) \end{matrix}$

$\max_{a_{P}} Q^{P} (s, a_{P}) = \max_{a_{P}} \max_{a_{A}} Q^{P} (s, a_{P}, a_{A}) = \max_{a_{P}} \min_{a_{A}} Q^{P} (s, a_{P}, a_{A})$

Then policies (π^P*, π^A*) are obtained:

J
^P(s,Q^P*,Q^A*)=[Σγ^t·r_t^P|s,π_*^P,π^A*] (127)

i.e., for any other policy π^P,

[Σγ^t·r_t^P|s,π^P*,π^A*]≥[Σγ^t·r_t^P|s,π^P,π^A*] (128)

for any other policy π^A,

[Σγ^t·r_t^A|s,π^P*,π^A*]≥[Σγ^t·r_t^A|s,π^P*,π^A] (129)

FIG. 13 is a schematic diagram of another example computing device 1300 that implements a system (e.g., the training engine 112 on trade platform 100) for training an automated agent having a neural network, in accordance with an embodiment. As depicted, computing device 1300 includes one or more processors 1302, memory 1304, one or more I/O interfaces 1306, and, optionally, one or more network interfaces 1308.

Each processor 1302 may be, for example, any type of general-purpose microprocessor or microcontroller, a digital signal processing (DSP) processor, an integrated circuit, a field programmable gate array (FPGA), a reconfigurable processor, a programmable read-only memory (PROM), or any combination thereof.

Memory 1304 may include a suitable combination of any type of computer memory that is located either internally or externally such as, for example, random-access memory (RAM), read-only memory (ROM), compact disc read-only memory (CDROM), electro-optical memory, magneto-optical memory, erasable programmable read-only memory (EPROM), and electrically-erasable programmable read-only memory (EEPROM), Ferroelectric RAM (FRAM) or the like. Memory 1304 may store code executable at processor 1302, which causes training system to function in manners disclosed herein. Memory 1304 includes a data storage. In some embodiments, the data storage includes a secure datastore. In some embodiments, the data storage stores received data sets, such as textual data, image data, or other types of data.

Each I/O interface 1306 enables computing device 1300 to interconnect with one or more input devices, such as a keyboard, mouse, camera, touch screen and a microphone, or with one or more output devices such as a display screen and a speaker.

Each network interface 1308 enables computing device 1300 to communicate with other components, to exchange data with other components, to access and connect to network resources, to serve applications, and perform other computing applications by connecting to a network such as network (or multiple networks) capable of carrying data including the Internet, Ethernet, plain old telephone service (POTS) line, public switch telephone network (PSTN), integrated services digital network (ISDN), digital subscriber line (DSL), coaxial cable, fiber optics, satellite, mobile, wireless (e.g. Wi-Fi, WiMAX), SS7 signaling network, fixed line, local area network, wide area network, and others, including any combination of these.

The methods disclosed herein may be implemented using a system that includes multiple computing devices 1300. The computing devices 1300 may be the same or different types of devices.

Each computing devices may be connected in various ways including directly coupled, indirectly coupled via a network, and distributed over a wide geographic area and connected via a network (which may be referred to as “cloud computing”).

For example, and without limitation, each computing device 1300 may be a server, network appliance, set-top box, embedded device, computer expansion module, personal computer, laptop, personal data assistant, cellular telephone, smartphone device, UMPC tablets, video display terminal, gaming console, electronic reading device, and wireless hypermedia device or any other computing device capable of being configured to carry out the methods described herein.

Embodiments performing the operations for anomaly detection and anomaly scoring provide certain advantages over manually assessing anomalies. For example, in some embodiments, all data points are assessed, which eliminates subjectivity involved in judgement-based sampling, and may provide more statistically significant results than random sampling. Further, the outputs produced by embodiments of system are reproducible and explainable.

The embodiments of the devices, systems and methods described herein may be implemented in a combination of both hardware and software. These embodiments may be implemented on programmable computers, each computer including at least one processor, a data storage system (including volatile memory or non-volatile memory or other data storage elements or a combination thereof), and at least one communication interface.

Program code is applied to input data to perform the functions described herein and to generate output information. The output information is applied to one or more output devices. In some embodiments, the communication interface may be a network communication interface. In embodiments in which elements may be combined, the communication interface may be a software communication interface, such as those for inter-process communication. In still other embodiments, there may be a combination of communication interfaces implemented as hardware, software, and combination thereof.

Throughout the foregoing discussion, numerous references will be made regarding servers, services, interfaces, portals, platforms, or other systems formed from computing devices. It should be appreciated that the use of such terms is deemed to represent one or more computing devices having at least one processor configured to execute software instructions stored on a computer readable tangible, non-transitory medium. For example, a server can include one or more computers operating as a web server, database server, or other type of computer server in a manner to fulfill described roles, responsibilities, or functions.

The foregoing discussion provides many example embodiments. Although each embodiment represents a single combination of inventive elements, other examples may include all possible combinations of the disclosed elements. Thus if one embodiment comprises elements A, B, and C, and a second embodiment comprises elements B and D, other remaining combinations of A, B, C, or D, may also be used.

The term “connected” or “coupled to” may include both direct coupling (in which two elements that are coupled to each other contact each other) and indirect coupling (in which at least one additional element is located between the two elements).

The technical solution of embodiments may be in the form of a software product. The software product may be stored in a non-volatile or non-transitory storage medium, which can be a compact disk read-only memory (CD-ROM), a USB flash disk, or a removable hard disk. The software product includes a number of instructions that enable a computer device (personal computer, server, or network device) to execute the methods provided by the embodiments.

The embodiments described herein are implemented by physical computer hardware, including computing devices, servers, receivers, transmitters, processors, memory, displays, and networks. The embodiments described herein provide useful physical machines and particularly configured computer hardware arrangements. The embodiments described herein are directed to electronic machines and methods implemented by electronic machines adapted for processing and transforming electromagnetic signals which represent various types of information. The embodiments described herein pervasively and integrally relate to machines, and their uses; and the embodiments described herein have no meaning or practical applicability outside their use with computer hardware, machines, and various hardware components. Substituting the physical hardware particularly configured to implement various acts for non-physical hardware, using mental steps for example, may substantially affect the way the embodiments work. Such computer hardware limitations are clearly essential elements of the embodiments described herein, and they cannot be omitted or substituted for mental means without having a material effect on the operation and structure of the embodiments described herein. The computer hardware is essential to implement the various embodiments described herein and is not merely used to perform steps expeditiously and in an efficient manner.

The embodiments and examples described herein are illustrative and non-limiting. Practical implementation of the features may incorporate a combination of some or all of the aspects, and features described herein should not be taken as indications of future or existing product plans. Applicant partakes in both foundational and applied research, and in some cases, the features described are developed on an exploratory basis.

Although the embodiments have been described in detail, it should be understood that various changes, substitutions and alterations can be made herein without departing from the scope as defined by the appended claims.

Moreover, the scope of the present application is not intended to be limited to the particular embodiments of the process, machine, manufacture, composition of matter, means, methods and steps described in the specification. As one of ordinary skill in the art will readily appreciate from the disclosure of the present invention, processes, machines, manufacture, compositions of matter, means, methods, or steps, presently existing or later to be developed, that perform substantially the same function or achieve substantially the same result as the corresponding embodiments described herein may be utilized. Accordingly, the appended claims are intended to include within their scope such processes, machines, manufacture, compositions of matter, means, methods, or steps.

REFERENCES

[1] Anschel, Oron and Baram, Nir and Shimkin, Nahum. Averaged-dqn: Variance reduction and stabilization for deep reinforcement learning. International Conference on Machine Learning, pages 176-185, 2017. PMLR.

[2] Bagherzadeh, Mojtaba and Kahani, Nafiseh and Briand, Lionel. Reinforcement Learning for Test Case Prioritization. arXiv preprint arXiv:2011.01834, 2020.

[3] Balch, Tucker Hybinette and Mahfouz, Mahmoud and Lockhart, Joshua and Hybinette, Maria and Byrd, David. How to Evaluate Trading Strategies: Single Agent Market Replay or Multiple Agent Interactive Simulation?. arXiv preprint arXiv:1906.12010, 2019.

[4] Bellemare, Marc G and Candido, Salvatore and Castro, Pablo Samuel and Gong, Jun and Machado, Marlos C and Moitra, Subhodeep and Ponda, Sameera S and Wang, Ziyu. Autonomous navigation of stratospheric balloons using reinforcement learning. Nature, 588(7836):77-82, 2020.

[5] Berner, Christopher and Brockman, Greg and Chan, Brooke and Cheung, Vicki and Debiak, Przemyslaw and Dennison, Christy and Farhi, David and Fischer, Quirin and Hashme, Shariq and Hesse, Chris and others. Dota 2 with large scale deep reinforcement learning. arXiv preprint arXiv:1912.06680, 2019.

[6] Bertsekas, Dimitri P. Neuro-dynamic programming Neuro-Dynamic Programming, pages 2555-2560. Springer US, Boston, Mass., 2009.

[7] Bloembergen, Daan and Hennes, Daniel and McBurney, Peter and Tuyls, Karl. Trading in markets with noisy information: An evolutionary analysis. Connection Science, 27(3):253-268, 2015.

[8] Bowling, Michael. Convergence problems of general-sum multiagent reinforcement learning. ICML, pages 89-94, 2000.

[9] Bowling, Michael and Veloso, Manuela. Multiagent learning using a variable learning rate. Artificial Intelligence, 136(2):215-250, 2002.

[10] Buldygin, V. V. and Kozachenko, Yu. V. Sub-Gaussian random variables. Ukrainian Mathematical Journal, 32(6):483-489, 1980.

[11] Byrd, David and Hybinette, Maria and Balch, Tucker Hybinette. Abides: Towards high-fidelity market simulation for ai research. arXiv preprint arXiv:1904.12066, 2019.

[12] Di Castro, Dotan and Tamar, Aviv and Mannor, Shie. Policy gradients with variance related risk criteria. arXiv preprint arXiv:1206.6404, 2012.

[13] Filar, J. and Vrieze, K. Competitive Markov Decision Processes. Springer, 1997.

[14] Garcia, Javier and Fernandez, Fernando. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research, 16(1):1437-1480, 2015.

[15] Henderson, Peter and Islam, Riashat and Bachman, Philip and Pineau, Joelle and Precup, Doina and Meger, David. Deep reinforcement learning that matters. Proceedings of the AAAI Conference on Artificial Intelligence, number 1, 2018.

[16] Hu, Junling and Wellman, Michael P. Multiagent Reinforcement Learning: Theoretical Framework and an Algorithm. Proceedings of the Fifteenth International Conference on Machine Learning in ICML '98, pages 242-250, San Francisco, Calif., USA, 1998. Morgan Kaufmann Publishers Inc.

[17] Johnson, Rie and Zhang, Tong. Accelerating Stochastic Gradient Descent using Predictive Variance Reduction. In C. J. C. Burges and L. Bottou and M. Welling and Z. Ghahramani and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems, 2013. Curran Associates, Inc.

[18] Li, Yuxi. Deep reinforcement learning: An overview. arXiv preprint arXiv:1701.07274, 2017.

[19] Li, Yuxi and Szepesvari, Csaba and Schuurmans, Dale. Learning exercise policies for american options. Artificial Intelligence and Statistics, pages 352-359, 2009. PMLR.

[20] Littman, Michael L. Value-function reinforcement learning in Markov games. Cognitive systems research, 2(1):55-66, 2001.

[21] Mihatsch, Oliver and Neuneier, Ralph. Risk-sensitive reinforcement learning. Machine learning, 49(2):267-290, 2002.

[22] Mirhoseini, Azalia and Goldie, Anna and Yazgan, Mustafa and Jiang, Joe and Songhori, Ebrahim and Wang, Shen and Lee, Young-Joon and Johnson, Eric and Pathak, Omkar and Bae, Sungmin and others. Chip placement with deep reinforcement learning. arXiv preprint arXiv:2004.10746, 2020.

[23] Morimoto, Jun and Doya, Kenji. Robust reinforcement learning. Neural computation, 17(2):335-359, 2005.

[24] Ning, Brian and Lin, Franco Ho Ting and Jaimungal, Sebastian. Double deep q-learning for optimal execution. arXiv preprint arXiv:1812.06600, 2018.

[25] Pan, Xinlei and Seita, Daniel and Gao, Yang and Canny, John. Risk averse robust adversarial reinforcement learning. 2019 International Conference on Robotics and Automation (ICRA), pages 8522-8528, 2019. IEEE.

[26] Pinto, Lerrel and Davidson, James and Sukthankar, Rahul and Gupta, Abhinay. Robust adversarial reinforcement learning. International Conference on Machine Learning, pages 2817-2826, 2017. PMLR.

[27] Sharpe, William F. The sharpe ratio. Journal of portfolio management, 21(1):49-58, 1994.

[28] Shen, Yun and Tobia, Michael J and Sommer, Tobias and Obermayer, Klaus. Risk-sensitive reinforcement learning. Neural computation, 26(7):1298-1328, 2014.

[29] Spooner, Thomas and Fearnley, John and Savani, Rahul and Koukorinis, Andreas. Market making via reinforcement learning. arXiv preprint arXiv:1804.04216, 2018.

[30] SzepesvÃ_iri, Csaba and Littman, Michael. A Unified Analysis of Value-Function-Based Reinforcement-Learning Algorithms. Neural computation, 11:2017-59, 1999.

[31] Théate, Thibaut and Ernst, Damien. An application of deep reinforcement learning to algorithmic trading. Expert Systems with Applications, 173:114632, 2021.

[32] Tobia, M. J. and Guo, Rong and Schwarze, U and Bohmer, Wendelin and Glascher, Jan and Finckh, Barbara and Marschner, A and Buchel, C and Obermayer, Klaus and Sommer, Tobias. Neural Systems for Choice and Valuation with Counterfactual Learning Signals. NeuroImage, 89, 2013.

[33] Tuyls, Karl and Perolat, Julien and Lanctot, Marc and Hughes, Edward and Everett, Richard and Leibo, Joel Z and Szepesvari, Csaba and Graepel, Thore. Bounds and dynamics for empirical game theoretic analysis. Autonomous Agents and Multi-Agent Systems, 34(1):1-30, 2020.

[34] Wainwright, Martin J. Variance-reduced Q-learning is minimax optimal. arXiv preprint arXiv:1906.04697, 2019.

[35] Walsh, William E and Das, Rajarshi and Tesauro, Gerald and Kephart, Jeffrey O. Analyzing complex strategic interactions in multi-agent systems. AAAI-02 Workshop on Game-Theoretic and Decision-Theoretic Agents, pages 109-118, 2002.

[36] Weibull, Jörgen W. Evolutionary game theory. MIT press, 1997.

[37] Weinberg, Michael and Rosenschein, Jeffrey S. Best-response multiagent learning in non-stationary environments. Proceedings of the Third International Joint Conference on Autonomous Agents and Multiagent Systems—Volume 2, pages 506-513, 2004

[38] Wellman, Michael P. Methods for empirical game-theoretic analysis. AAAI, pages 1552-1556, 2006.

SYSTEM AND METHOD FOR RISK SENSITIVE REINFORCEMENT LEARNING ARCHITECTURE

Information

Publication Number

Date Filed

Date Published

Inventors

Original Assignees

CPC

International Classifications

Abstract

Description

Claims

CROSS-REFERENCE TO RELATED APPLICATION

Provisional Applications (1)