The embodiments are directed to reinforcement learning frameworks, and more specifically to a rational inattention reinforcement learning framework.
Multi-agent reinforcement learning (MARL) has shown great utility in complex agent-based simulations in economics, games, and other fields. In such simulations, the behavioral rules of agents may be too difficult for designers to specify. Instead, when using MARL, designers specify objective functions for the agents and use reinforcement learning (RL) to learn agent behaviors that optimize the specified objectives. This approach may be problematic when simulating systems of human agents. This is because agents behave rationally and execute the objective-maximizing behavior, in contrast to established models of human decision-making. For instance, behavioral economics has found that humans are often irrational due to various cognitive biases and limitations. Additionally, the models of irrationality yield results and implications that are significantly different from the results obtained using rationality assumptions. Therefore, human irrationality should be accounted for when simulating systems involving human(-like) agents.
In one or more implementations, not all of the depicted components in each figure may be required, and one or more implementations may include additional components not shown in a figure. Variations in the arrangement and type of the components may be made without departing from the scope of the subject disclosure. Additional components, different components, or fewer components may be utilized within the scope of the subject disclosure.
The embodiments are directed to a rational inattention reinforcement learning (RIRL) framework. The RIRL framework may be a MARL framework with agents that may be rationally inattentive. Rational inattention (RI) model is a model of bounded rationality. The RI model attributes human irrationality to the costliness of mental effort (e.g. attention) required to identify the optimal action. Mathematically, the RI model measures these costs as the mutual information (MI) between the variables considered by the decision process and the decisions ultimately made. This captures the intuition that a policy has a higher cognitive cost if its execution requires more information about the state of the world and thus more attention. When used to model sub-optimal behavior, the RIRL framework may rationalize seemingly sub-optimal behavior by including the cognitive cost in the reward function, i.e., by adding the MI cost(s). In this way, the “rational” behaviors of the RIRL-agent may mimic human-like bounded rationality.
The RIRL framework is a tool for modeling boundedly rational (sub-optimal) behavior and its emergent consequences in multi-agent systems. In some embodiments, the RIRL framework may model classical economic settings intractable under the conventional frameworks.
The RIRL framework extends the single-timestep framework which decomposes decision-making into two steps: stochastic perception followed by stochastic action. The stochastic perception and the stochastic action are each subjected to their own MI cost. The RIRL framework generalizes the stochastic perception to multiple information channels with heterogeneous costs, hidden-state policy models, and sequential environments. The RIRL framework also provides a general-purpose technique to compute MI-based rewards and a novel boundedly-rational policy class. This allows the RIRL framework to model settings with rich cognitive cost structures, e.g., when information about some state variables may be more difficult to observe than others. For example, when applying for a job, a job candidate's past job performance may be more relevant than her overall employment history but also harder to evaluate by a hiring manager.
The RIRL framework may analyze complex scenarios that conventional frameworks may not. For instance, the RIRL framework may study a principal-agent (PA) problem, where a principal and agent are computing entities that simulate human behavior and where a principal is boundedly rational. In the PA problem, a principal employs one or more agents, but both parties have misaligned incentives and/or asymmetric information. For example, a profit-maximizing employer (e.g. the principal) must consider the best compensation scheme for motivating a (team of) employee(s) (e.g. agents) to work. However, it is difficult for the employer to determine how much and what work the employee(s) actually do(es).
A real-world PA experiments have shown that bounded rationality is key to explaining marked deviations between equilibria reached by human participants and theoretical predictions reached by the computer simulations under rational assumptions. This is partially because theoretical analyses of PA problems often rely on stylized modeling assumptions, e.g., rationality or linearity, to be tractable. Additionally, the RIRL framework enables more flexible and natural models of information asymmetry. Rather than assuming certain information is not available, the principal is allowed to (implicitly) choose how much information to observe and pay a cost to do so.
The RIRL framework may analyze generalized PA problems that are analytically intractable, such as a sequential PA problem with multiple computing agents, using heterogeneous information channels. Across all settings, the equilibrium behavior depends strongly on the cost of attention and differs from the behavior under rational assumptions. Depending on the channel, increasing principal's inattention may either increase the agent's welfare dur to increased compensation or decrease agent's welfare dur to encouraging additional work. Further, the RIRL framework indicates agents implementing different strategies, which may be referred to as signaling. These strategies may include agents learning to misrepresent their ability. The RIRL framework may be a bounded tool to model bounded rational (sub-optimal) behavior and analyze emergent consequences in multi-agent systems.
Memory 120 may be used to store software executed by computing device 100 and/or one or more data structures used during operation of computing device 100. Memory 120 may include one or more types of machine readable media. Some common forms of machine readable media may include floppy disk, flexible disk, hard disk, magnetic tape, any other magnetic medium, CD-ROM, any other optical medium, punch cards, paper tape, any other physical medium with patterns of holes, RAM, PROM, EPROM, FLASH-EPROM, any other memory chip or cartridge, and/or any other medium from which a processor or computer is adapted to read.
Processor 110 and/or memory 120 may be arranged in any suitable physical arrangement. In some embodiments, processor 110 and/or memory 120 may be implemented on a same board, in a same package (e.g., system-in-package), on a same chip (e.g., system-on-chip), and/or the like. In some embodiments, processor 110 and/or memory 120 may include distributed, virtualized, and/or containerized computing resources. Consistent with such embodiments, processor 110 and/or memory 120 may be located in one or more data centers and/or cloud computing facilities.
In some examples, memory 120 may include a non-transitory, tangible, machine readable media that includes executable code that when run by one or more processors (e.g., processor 110) may cause the one or more processors to perform the methods described in further detail herein. In some embodiments, Memory 120 stores a rational inattention reinforcement learning (RIRL) framework 130 discussed above. The RIRL framework 130 may be a neural network or a combination of neural networks trained to simulate emerging behavior of computing actors 135 in computing simulations. The computing actors 135 are actors that emulate human behavior, such as a principal and agent(s) in the principal-agent problem. The computing simulations may be simulations associated with a simulated real-world, economic behavior, or gaming environment. Unlike conventional frameworks, the RIRL framework 130 is trained to incorporate a human irrationality component, also referred to as rational inattention (RI), into the actor's behavior. In other words, the RIRL framework 130 simulates the actors 135 thinking irrationally to account for irrationality component of the human behavior. The RIRL framework 130 models human irrationality as the cost of cognitive information processing using mutual information of other actors' in the environment.
In some embodiments, the RIRL framework 130 receives observations 140 as input. The observations 140 may be observations associated with a particular environment and may be made by actors 135. Observations 140 may be observed by one, all, or a subset of actors 135, and RIRL framework 130 may be trained to determine actions from observations of different actors. Based on observations 140, the RIRL framework 130 may generate actions 150 that may be taken by actors 135. The RIRL framework 130 may also determine rewards for taking the actions 150.
While conventional reinforcement learning frameworks may be used to discover approximate utility-maximizing policies, simulations built from such “rational” behavior fail to account for the characteristic irrational behavior of human decision makers. Behavioral economic models may account for such patterns as consequences of inattention. The RIRL framework 130 uses rational inattention to formalize this intuition using a modified objective. The modified objective includes a cost to the mutual information Ĩ(ai; oi) between the (observable) state of the world oi and the actions ai˜πi(⋅|oi). This definition captures the intuition that if the agent puts in more effort to pay attention to observation oi, its action ai likely becomes more correlated with the observation oi, and thus resulting in a high mutual information (MI).
In the one-step setting, the optimal rationally inattentive policy πi† may be given by:
Note that this is equivalent to learning the optimal policy for adjusted reward function ri†, where reward may be as follows:
r
i
†(st,at)=Ui(st,at)−λĨ(ai,t;oi,t), (2)
The mutual information Ĩ(ai,t; oi,t)=log p(ai,t; oi,t)−log p (ai,t)p(oi,t) is a Monte Carlo (MC) estimate of I(ai; oi) and λ is the utility cost per bit of information. The terms p(ai, oi),p(ai) and p(oi) are the joint and marginal distributions over at and oi (i.e. the observations and associated actions for actor i) induced by the environment and the set of actors' policies π.
The RIRL framework 130 may estimate the mutual information. To estimate the mutual information, the RIRL framework 130 may utilize a general-purpose module for estimating Ĩπ
As discussed above, the RIRL framework 130 may decouple an action from a perception using multiple information channels. The RIRL framework 130 penalizing the I(ai; oi), models the intuition that information about the e.g. observable state of the world is costly to obtain or use. To support richer modeling, the RIRL framework 130 may comprise multiple channels of information with heterogeneous cognitive costs. For example, when purchasing a used car, information about the prices of available cars is much easier to come by than information about their conditions. The RIRL framework 130 may model the prices of available cars and their conditions as separate information channels associated with different cognitive costs.
To that end, the RIRL framework 130 extends the action-perception decoupling strategy, which models a policy π(a|s) using a stochastic perception module q(y|s) followed by an action module p(a|y), jointly trained to optimize an RI-style reward r(s, a)=λqIq(y; s)−λpIp(a; y). The RIRL framework 130 may have a policy class that can flexibly model scenarios where different information channels (i.e. partitions of o) have different processing costs λm. The RIRL framework 130 may also use recurrent policies which may allocated processing costs strategically over time.
In some embodiments, RIRL framework 130 may decompose a given actor observation ot state into a set of M≥1 observations ot={ot1, . . . , otM}, with otm being an observation from information channel m. As illustrated in
The RIRL framework 130 may include encoders 208, such as encoders 208A-M. There may be a configurable number of encoders 802A-M. The RIRL framework 130 assumes that each channel has an associated information cost, given as Δ={λ1, . . . , λm} for encoders 208. Each information channel may be one of encoders 208A-M. Encoders 208A-M may be stochastic encoders. For each channel, the RIRL framework 130 is trained on a separate encoder ƒm(ytm|otm, ψt) (encoders 208A-M), which receives otm and recurrent state ψt (shown as 206) of a long-short term memory (LSTM) 218 (discussed below) as inputs and outputs encodings 212A-M. Encoding 212A-M may be parameters, such as means and standard deviations of a stochastic encoding ytm. The encoders ƒm are illustrated in
μtm,σtm=ƒm(otm,ψt) (3)
y
t
m
=o
t
m+μtm+σtm·ϵtm (4)
where ϵtm is a random sample from a spherical Gaussian with dimensionality equal to that of ym and om.
In some embodiments, encoders ƒm (encoders 208A-M) may include discriminators 210A-M. For example, encoder 208A may include discriminator 210A, encoder 208B may include discriminator 210B, etc. During the training phase, for each one of encoders ƒm (208A-M), a corresponding discriminator dƒm(ytm, [otm, ψt]) (shown as discriminators 210A-210M) is trained to estimate mutual information Ĩƒm(ytm; [otm, ψt]) (shown as 214A-214M). The estimated Ĩƒm(ytm; [otm, ψt]) is the cost of the mutual information associated with encoders 208A-M.
In some embodiments, encodings 212A-M may be concatenated using a concatenation module 215 into a full encoding 216. Encoding 216 may be represented as encoding yt=[yt1, . . . , ytM] of ot observations concatenated across all M encoder samples (encodings 212A-M) generated by encoders 208A-M.
LSTM 218 may receive the full encodings 216 and update the internal state ψt of the LSTM 218 with full encodings 216. In other words, LSTM 218 may maintain a history of encoded information: ψt+1=LSTM(yt, ψt) (shown as 220). The previous or non-updated state ψt (shown as 206) of LSTM 218 may be propagated as input to encoders 208A-M as discussed above.
The observation decomposition module 202, encoders 802A-M, concatenation module 215 and LSTM 218 may be components of the stochastic perception module, discussed above.
Full encodings yt (216) and updated LSTM state ψt+1 (220) are inputs to a stochastic action module ω(at|tt+1) (shown as an action module 222). Action module 222 may output a probability distribution over actions 223, from which actions 150 may be selected. In some embodiments, action module 222 may be a neural network, such as a multi-layer perceptron network. Action module 222 may also include a discriminator 224 that generates an estimate Ĩƒm(at; [otm, ψt]). The estimated Ĩƒm(at; [otm, ψt]) is a cost of mutual information 226 of the action module 222.
During the training phase, the RIRL framework 130 may be trained with policy gradients, as shown below:
Δπ∝(∇ log π(yt1, . . . ,ytM,at|st,ψt,ψt+1)rt†), (5)
log π(yt1, . . . ,ttM,at|st,ψt,ψt+′)=log ω(at|yt,ψ′t+1)+Σm=1M log ƒm(ytm|otm,ψt), (6)
r
t
†
=U(st,at)−λwĨw(at|yt,ψt+1)−Σm−1MλmĨƒ
Notably, the reward rt† generated by the RIRL framework 130 takes into account the cost of information Ĩƒ
At process 302, at least one observation is decomposed into a set of observations. For example, RIRL framework 130 may receive and decompose observation 140 into a set of observations 204A-M.
At process 304, the set of observations are passed through multiple information channels. As discussed above, the RIRL framework 130 includes a stochastic perception module with multiple information channels having heterogenous costs. The information channels may be modeled as encoders 208A-M. The set of observations 204A-M along with recurrent state 206 of LSTM 218 are passed through corresponding encoders 208A-M to generate encodings 212A-M.
At process 306, a cost of mutual information (MI) at encoders of a stochastic perception module is measured. As discussed above, encoders 208A-M include corresponding discriminators 210A-M. As the set of observations 204A-M are passed through encoders 208A-M that are the multiple information channels having heterogenous costs, discriminators 210A-M estimate the cost of MI 214A-M. In some embodiments, process 304 may be performed in parallel with process 304.
At process 308, encodings 212A-M are concatenated. For example, RIRL framework 130 may concatenate encodings 212A-M into concatenated encodings 216. Concatenated encodings 216 may be in a form of a vector.
At process 310, a concatenated encoding is stored in an LSTM. For example, RIRL framework 130 may store the concatenated encodings 216 in the internal state ψt (206) in the LSTM 218. LSTM 218 may maintain history of encoded information. That is, LSTM 218 may include an internal state ψt (206) that is updated with concatenated encodings 216. The updated internal state of the LSTM is internal state ψt+1 (220).
At process 312, a distribution of actions are generated. Action module 222 of the RIRL framework 130 may receive the concatenated encodings 216 and a history of encoded information as internal LSTM state ψt+1 (220), and generate a distribution of actions from which actions 150 are selected.
At process 314, a cost of MI of an action module is measured. For example, action module 222 may include discriminator 224 that measures the cost of mutual information 226.
At process 316 a reward function is computed. The RIRL framework 130 computes a reward function using the cost of mutual information 214A-M, the cost of mutual information 226, and actions 150.
Going back to
z
i,t
=h
i,t(vik+ei,t), (8)
where agent i works hi,t hours and exerts effort ei,t. The principal 402 may move first and set a wage ωi,t. Each agent i may move second. Agent i may know wage ωi,t before choosing work hours hi,t and ei,t, and in return earning income ωi,t×hi,t. The utility Up of principal 402 measures profit. The utility Ua may be defined using standard utility functions, where the optimal hours h increases with the wage ω. As a consequence of this configuration, the profit-maximizing wage ωi for agent i increases with its ability vik. The agent's utility may be determined as follows:
where ρ, ci and α are constants governing the shape of Ua.
In some embodiments, a strategic principal 402 may infer private features of agent 404. Example features may be the agent's ability. Further, the agent's equilibrium behavior may depend on any inference costs the agent experiences, e.g., attention costs. In some embodiments, to isolate and explore the effects of distinct principal attention costs, attention costs are not imposed on the agents 404. In this case, the agents' reward is the agents' utility ri,t=Ua(ψi,t, hi,t, ei,t).
For principal 402, RIRL framework 130 may be trained with three information channels (M=3). The first channel may have an “easy” and low-cost to observe agents 404, and the second and third channels may have a “hard” and high-cost. The low-cost channel opƒ may include information that may be freely available (λƒ=0). This information may be the time t, the hours agents 404 worked h (e.g. timesheets may be available which makes hours hi agent i worked easy to determine), and the total output, Z=Σi∈[n
As such, the bounded rationality of principal 402 may be modeled through the cost to get information about effort and individual outputs of agents 404. The RIRL framework 130 for principal-actor architecture may also use attention costs Ĩ(ytθ; otƒ) and Ĩ(ωt; yt), which are omitted for purposes of simplicity.
The results of the RIRL framework 130 analyzing the principal-agent problem are shown in
This is because the principal 402 has a different optimal wage for each ability vk as shown in
Further, higher λz increases the cost of distinguishing between the individual outputs of each agent i. Consequentially, while the agent i's utility increases with λz, the agent utility does not increase for all agent types. Specifically, the utility of the (highest) lowest-ability agent's increases. Hence, the principal's uncertainty over individual outputs decreases the wage (and utility) differences between agents of different ability as illustrated in
Some examples of computing devices, such as computing device 100 may include non-transitory, tangible, machine readable media that include executable code that when run by one or more processors (e.g., processor 110) may cause the one or more processors to perform the processes of method 300. Some common forms of machine readable media that may include the processes of method 300 are, for example, floppy disk, flexible disk, hard disk, magnetic tape, any other magnetic medium, CD-ROM, any other optical medium, punch cards, paper tape, any other physical medium with patterns of holes, RAM, PROM, EPROM, FLASH-EPROM, any other memory chip or cartridge, and/or any other medium from which a processor or computer is adapted to read.
This description and the accompanying drawings that illustrate inventive aspects, embodiments, implementations, or applications should not be taken as limiting. Various mechanical, compositional, structural, electrical, and operational changes may be made without departing from the spirit and scope of this description and the claims. In some instances, well-known circuits, structures, or techniques have not been shown or described in detail in order not to obscure the embodiments of this disclosure. Like numbers in two or more figures represent the same or similar elements.
In this description, specific details are set forth describing some embodiments consistent with the present disclosure. Numerous specific details are set forth in order to provide a thorough understanding of the embodiments. It will be apparent, however, to one skilled in the art that some embodiments may be practiced without some or all of these specific details. The specific embodiments disclosed herein are meant to be illustrative but not limiting. One skilled in the art may realize other elements that, although not specifically described here, are within the scope and the spirit of this disclosure. In addition, to avoid unnecessary repetition, one or more features shown and described in association with one embodiment may be incorporated into other embodiments unless specifically described otherwise or if the one or more features would make an embodiment non-functional.
Although illustrative embodiments have been shown and described, a wide range of modification, change and substitution is contemplated in the foregoing disclosure and in some instances, some features of the embodiments may be employed without a corresponding use of other features. One of ordinary skill in the art would recognize many variations, alternatives, and modifications. Thus, the scope of the invention should be limited only by the following claims, and it is appropriate that the claims be construed broadly and in a manner consistent with the scope of the embodiments disclosed herein.
This application is a nonprovisional of and claims priority under 35 U.S.C. 119 to U.S. Provisional Application No. 63/252,546, filed Oct. 5, 2021, which is hereby expressly incorporated by reference herein in its entirety.
Number | Date | Country | |
---|---|---|---|
63252546 | Oct 2021 | US |