The present disclosure relates to a neural network feature extractor for actor-critic reinforcement learning models.
Reinforcement learning is an area of machine learning concerned with how intelligent agents ought to take actions in an environment in order to maximize the notion of cumulative reward. Reinforcement learning usually involves time series information. The agent determines how to act in the future based on the current state and discounted rewards. The way in which a state is reached can often result in hidden variables that the observable variables do not fully characterize. This is known as a partially-observable Markov decision process (POMDP). These hidden variable may not be directly knowable.
In an embodiment, a method of optimizing a charging of a vehicle battery using reinforcement learning includes: via one or more electronic battery sensors, determining observable battery state data associated with charging of a vehicle battery, wherein vehicle battery state information includes the observable battery state data and hidden battery state information; via a sequence-processing neural network feature extractor (SPNNFE), extracting features from preceding vehicle battery state information; providing an actor-critic model including (i) an actor model configured to produce an output associated with a charge command to charge the vehicle battery, and (ii) a critic model configured to output a predicted reward; and training the actor-critic model based on (i) the vehicle battery state information, and (ii) the extracted features. The training includes: updating weights of the actor model to maximize the predicted reward output by the critic model, and updating weights of the SPNNFE and weights of the critic model to minimize a difference between (i) the predicted reward output by the critic model and (ii) health-based rewards received from charging of the vehicle battery. The method includes approximating at least some of the hidden battery state information based on the extracted features in order to optimize charging of the vehicle battery.
In an embodiment, a system for optimizing a charging of a vehicle battery using reinforcement learning includes one or more processors, and memory storing instructions that, when executed by the one or more processors, cause the one or more processors to: via one or more electronic battery sensors, determine observable battery state data associated with charging of a vehicle battery, wherein vehicle battery state information includes the observable battery state data and hidden battery state information; via a sequence-processing neural network feature extractor (SPNNFE), extract features from preceding vehicle battery state information; provide an actor-critic model including (i) an actor model configured to produce an output associated with a charge command to charge the vehicle battery, and (i) a critic model configured to output a predicted reward; and train the actor-critic model based on (i) the vehicle battery state information, and (ii) the extracted features. The actor-critic model is trained via: updating weights of the actor model to maximize the predicted reward output by the critic model, and updating weights of the SPNNFE and weights of the critic model to minimize a difference between (i) the predicted reward output by the critic model and (ii) health-based rewards received from charging of the vehicle battery. Further, at least some of the hidden battery state information is approximated based on the extracted features in order to optimize charging of the vehicle battery.
In an embodiment, a method of approximating hidden state information of a reinforcement learning model includes: via one or more electronic sensors, determining observable state information, wherein state information includes the observable state information and hidden state information; via a recurrent neural network feature extractor (RRNFE), extracting features from preceding state information; providing an actor-critic model including (i) an actor model configured to produce an output associated with a control system, and (ii) a critic model configured to output a predicted reward; and training the actor-critic model based on the state information and the extracted features. The training includes: updating weights of the actor model to maximize the predicted reward output by the critic model, and updating weights of the RRNFE and weights of the critic model to minimize a difference between (i) the predicted reward output by the critic model and (ii) rewards associated with the output for the control system. The actor-critic model is used to approximate at least some of the hidden state information based on the extracted features.
Embodiments of the present disclosure are described herein. It is to be understood, however, that the disclosed embodiments are merely examples and other embodiments can take various and alternative forms. The figures are not necessarily to scale; some features could be exaggerated or minimized to show details of particular components. Therefore, specific structural and functional details disclosed herein are not to be interpreted as limiting, but merely as a representative basis for teaching one skilled in the art to variously employ the embodiments. As those of ordinary skill in the art will understand, various features illustrated and described with reference to any one of the figures can be combined with features illustrated in one or more other figures to produce embodiments that are not explicitly illustrated or described. The combinations of features illustrated provide representative embodiments for typical applications. Various combinations and modifications of the features consistent with the teachings of this disclosure, however, could be desired for particular applications or implementations.
Reinforcement learning is an area of machine learning concerned with how intelligent agents ought to take actions in an environment in order to maximize the notion of cumulative reward. Reinforcement learning usually involves time series information. The critic learns the expected rewards and the actor learns a policy intended to maximize those rewards. The agent determines how to act in the future based on the current state and discounted rewards. The way in which a state is reached can often result in hidden variables that the observable variables do not fully characterize. These hidden variable may not be directly knowable.
Reinforcement learning can be applied to things that can be modeled in principle, for example in Markov decision processes (MDP). Once class of MDP is a partially-observable MDP (POMDP). Here, the state space is split into the observable space and the hidden space. This can create difficult situations to learn from because the same observable state space would have different hidden states that govern behavior. So, while models can detect that something indeed is happening during the training, the decision is based on these hidden states and thus information is undetectable. In other words, the models can understand that as inputs change, the outputs vary, but the hidden states that also impact the output are not fully understood.
In one example, in video-based reinforcement learning, a pre-trained convolutional neural network (CNN) is often used as a feature extractor to allow for details within an image to be tracked. For example, features of a detected object (e.g., its class, speed, orientation, etc.) can be extracted via a trained CNN for reinforcement learning. However, some information (e.g., private information, temporal information, etc.) can be encoded or otherwise undetectable by the model, and thus much of the information is not fully understood.
According to various embodiments disclosed herein, an actor-critic model can use preceding time steps in an MDP, and feed this into a recurrent neural network (RNN) feature extractor to learn features which are an attempt to model some of these hidden parameters. By extracting these features, the model can get a more complete state space that would be known otherwise. This improves the machine learning overall because in addition to the observable state, the model can now be provided with extracted features which are a projection of the hidden state space.
During backpropagation to train the model, the loss from the critic is fed back through the RNN feature extractor to update the feature extractor weights; the feature extractor weights are not updated during the actor model backpropagation. The critic network adapts and alters weights based on the extracted features, and that same feature information is fed into the overall model.
The feature extractor disclosed herein can encode some of the variables hidden from the observable state in a POMDP by feeding in many of the preceding states. Using this additional information then as part of the state can result in an improved policy due to the additional information. This effectively renders some of the unobservable information in the POMDP observable in a way that the reward system alone cannot capture.
Referring to the Figures, reference is now made to the embodiments illustrated in the Figures, which can apply these teachings to a machine learning model or neural network.
In some embodiments, the data storage 106 may further comprise a data representation 108 of an untrained version of the neural network which may be accessed by the system 100 from the data storage 106. It will be appreciated, however, that the training data 102 and the data representation 108 of the untrained neural network may also each be accessed from a different data storage, e.g., via a different subsystem of the data storage interface 104. Each subsystem may be of a type as is described above for the data storage interface 104. In other embodiments, the data representation 108 of the untrained neural network may be internally generated by the system 100 on the basis of design parameters for the neural network, and therefore may not explicitly be stored on the data storage 106. The system 100 may further comprise a processor subsystem 110 which may be configured to, during operation of the system 100, provide an iterative function as a substitute for a stack of layers of the neural network to be trained. Here, respective layers of the stack of layers being substituted may have mutually shared weights and may receive as input an output of a previous layer, or for a first layer of the stack of layers, an initial activation, and a part of the input of the stack of layers. The processor subsystem 110 may be further configured to iteratively train the neural network using the training data 102. Here, an iteration of the training by the processor subsystem 110 may comprise a forward propagation part and a backward propagation part. The backpropagation and/or forward propagation can continue until the models achieve a predetermined performance level (e.g., 100% agreement with the outcomes associated with the training data), or convergence. It should be understood that in this disclosure, “convergence” can mean a set (e.g., predetermined) number of iterations have occurred, or that the residual is sufficiently small (e.g., the change in the approximate probability over iterations is changing by less than a threshold), or other convergence conditions. The system 100 may further comprise an output interface for outputting a data representation 112 of the trained neural network, this data may also be referred to as trained model data or trained model data 112. For example, as also illustrated in
The structure of the system 100 is one example of a system that may be utilized to train the models described herein. Additional structure for operating and training the machine-learning models is shown in
The processor 202 is programmed to process signals and perform general computing and arithmetic functions. Signals processed by the processor may include digital signals, data signals, computer instructions, processor instructions, messages, a bit, a bit stream, or other means that may be received, transmitted, and/or detected. Generally, the processor may be a variety of various processors including multiple single and multicore processors and co-processors and other multiple single and multicore processor and co-processor architectures. The processor may include various modules to execute various functions.
The processor 202 may include one or more devices selected from high-performance computing (HPC) systems including high-performance cores, microprocessors, micro-controllers, digital signal processors, microcomputers, central processing units, field programmable gate arrays, programmable logic devices, state machines, logic circuits, analog circuits, digital circuits, graphics processing units (GPUs) tensor processing units (TPUs), vision processing units (VPUs), or any other devices that manipulate signals (analog or digital) based on computer-executable instructions residing in memory 204. In some examples, the processor 202 may be a system on a chip that integrates functionality of a central processing unit, the memory 204, a network interface, and input/output interfaces into a single integrated device.
Upon execution by processor 202, the computer-executable instructions residing in the memory 204 may cause an associated control system to implement one or more of the machine-learning algorithms and/or methodologies as disclosed herein. The memory 204 may also include machine-learning data (including data parameters) supporting the functions, features, and processes of the one or more embodiments described herein.
The program code embodying the algorithms and/or methodologies described herein is capable of being individually or collectively distributed as a program product in a variety of different forms. The program code may be distributed using a computer readable storage medium (e.g., memory 204) having computer readable program instructions thereon for causing the processor 202 to carry out aspects of one or more embodiments. Computer readable storage media, which is inherently non-transitory, may include volatile and non-volatile, and removable and non-removable tangible media implemented in any method or technology for storage of information, such as computer-readable instructions, data structures, program modules, or other data. Computer readable storage media may further include RAM, ROM, erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other solid state memory technology, portable compact disc read-only memory (CD-ROM), or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to store the desired information and which can be read by a computer. Computer readable program instructions may be downloaded to a computer, another type of programmable data processing apparatus, or another device from a computer readable storage medium or to an external computer or external storage device via a network.
The processes, methods, or algorithms can be embodied in whole or in part using suitable hardware components, such as Application Specific Integrated Circuits (ASICs), Field-Programmable Gate Arrays (FPGAs), state machines, GPUs, controllers or other hardware components or devices, or a combination of hardware, software and firmware components.
The bus 206 can refer to an interconnected architecture that is operably connected to other computer components inside a computer or between computers. The bus may transfer data between the computer components. The bus may be a memory bus, a memory controller, a peripheral bus, an external bus, a crossbar switch, and/or a local bus, among others. In embodiments in which the battery is a vehicle battery, the bus may be a vehicle bus that interconnects components inside a vehicle using protocols such as Media Oriented Systems Transport (MOST), Controller Area network (CAN), Local Interconnect Network (LIN), among others.
The simulator 208 or the processor 202 may generate a policy network 240. In particular, the reinforcement learning disclosed herein, such as the actor-critic models, can include a deep deterministic policy gradient (DDPG), more specifically a twin delayed deep deterministic policy gradient (TD3), in order to construct a charging policy to optimize the battery charging. This can include a reward system design to minimize the charging time and degradation of the battery. The reward system is the combination of rewards given by the environment and any post-processing performed by the agent, such as the discount factor, that affect the quantitative assignment of the loss function. TD3 methods in particular allow for off-policy and offline learning, enabling the disclosed hybrid approach. The policy network 240 may be stored on the memory 204 of the system 100 for the reinforcement learning.
The system 200 may further include a communication interface 250 which enables the policy network 240 to be transmitted to other devices, such as a server 260, which may include a reinforcement learning database 262. In this way, the policy network 240 generated by the system 200 for reinforcement learning may be stored on a database of the server 160. The communication interface 250 may be a network interface device that is configured to provide communication with external systems and devices (e.g., server 260). For example, the communication interface 250 may include a wired and/or wireless Ethernet interface as defined by Institute of Electrical and Electronics Engineers (IEEE) 802.11 family of standards. The communication interface 250 may include a cellular communication interface for communicating with a cellular network (e.g., 3G, 4G, 5G). The communication interface 250 may be further configured to provide a communication interface to an external network (e.g., world-wide web or Internet) or cloud, including server 260.
The server 260 may then propagate the policy network 240 to one or more vehicles 270. While only one vehicle 270 is shown, it should be understood that more than one vehicle 270 may be provided in the system. Each of the vehicles can be either a simulation vehicle (e.g., used in lab simulations) or a field vehicle (e.g., vehicles used by consumers in actual driving and/or charging events). In hybrid embodiments, the system includes both simulation vehicle(s) and a field vehicle(s). In non-hybrid embodiments, the vehicle 270 may include a simulation vehicle without a field vehicle. The vehicle 270 may be any moving vehicle that is capable of carrying one or more human occupants, such as a car, truck, van, minivan, SUV, motorcycle, scooter, boat, personal watercraft, and aircraft. In some scenarios, the vehicle includes one or more engines. The vehicle 270 may be equipped with a vehicle communication interface 272 configured to communicate with the server 260 in similar fashion as the communication interface 260. The vehicle 270 may also include a battery 274 that is configured to at least partially propel the vehicle. Therefore, the term “vehicle” may refer to an electric vehicle (EV) that is powered entirely or partially by one or more electric motors powered by the electric battery 274. The EV may include battery electric vehicles (BEV) and plug-in hybrid electric vehicles (PHEV), wherein the battery 274 propels the vehicle 270. Additionally, the term “vehicle” may refer to an autonomous vehicle and/or self-driving vehicle powered by any form of energy. The autonomous vehicle may or may not carry one or more human occupants.
The vehicle 270 also includes a battery management system 276 configured to manage, operate, and control the battery 274. In particular, the policy network 240 output from the system 200 and sent to the vehicle 270 via the server 260 can command the battery management system 276 to control the battery 274 to charge or discharge in a particular manner. Therefore, the battery management system 276 may refer to associated processors and memory (such as those described above) configured to charge the battery 274 according to stored or modified instructions. The battery management system 276 may also include various battery state sensors configured to determine the characteristics of the battery 274, such as voltage, temperature, current, amplitude, resistance, and the like. These determined signals can, when processed, determine a state of health of the battery 274.
Q-learning is a form of reinforcement learning which uses Q-values (also called action values) to iteratively improve the behavior of the learning agent. “Q” refers to the function that the algorithm computes—the expected rewards for an action taken in a given state. Q-values can be defined for states and actions on the environment, and represent an estimation of how good it is to take the action at the state.
The diagram of
An action-critic algorithm relies on using two neural networks to accomplish different tasks: the actor A, which takes as input the state, s, and outputs the action, a; A(s)=a, and the critic, C, which takes as input the state and action, and outputs the expected Q-value, C(s, a)=Q. The critic model learns from the data to estimate the Q-value (expected reward) from the state given a particular next action C(s, a)=Q, and rewards what is good, and passes this information on to the actor. The actor model learns a policy that maximizes the expected Q-value from the critic, resulting in the highest reward. The value and scale of Q are dictated by the somewhat arbitrary defined rewards system. For a fixed state, s, the highest value of C(s, a) generally corresponds to the best action to take from the state.
For hybrid applications—using simulation data as well as field data from real usage of vehicle batteries—the actor-critic setup can be a bit different.
Each of the models disclosed herein can be implemented by a neural network or deep neural network (DNN), an example of which is schematically illustrated in
As explained above, for a POMDP with fully-characterized states separated into an observable state and hidden state, a policy constructed accessing the full state can generally perform better than a policy constructed with just the observable states. A feature extractor is disclosed herein which is configured to learn features which are an attempt to model some of these hidden parameters. By extracting these features, the model can get a more complete state space that would be known otherwise. Using reinforcement learning described above, the “agent” comprises an actor, one or more critics, and a feature extractor. In the context of optimizing battery charging, the “environment” would be either the battery simulator or a real in-field battery. In other contexts, the “environment” can vary. Each of these systems has various inputs, outputs and update rules (e.g., learning). In the basic Markov decision process, one has an action (a), a state (s), and a reward (r). The agent and environment components include an environment (E), a feature extractor (F), an actor (A), and a critic (C). Composite components include the following:
The inputs, outputs, and update rules can be summarized as follows. The environment has an input of the full state and action (s, a), and an output of the next state and reward (s′, r). The feature extractor has an input of {si} for N observable steps, an output of feature f, and critic update rules of (q+r−q)2. The critic has an input of the observable state, features, action (s,f, a), an output of q, and update rules of (q+r−q′)2. Note here that both q and q′ are critic predictions, and the goal is to minimize this difference. The actor has inputs of the observable state and features (s,f), an output of action a, and update rules of −q (note that a high q is deemed good, so the model tries to minimize −q.
Given the above, if some of the unknown hidden state can be extracted from the history of the observable state, then a RNN (such as a long short-term memory) can partially or fully produce a set of variables that encodes the information contained in the hidden state. According to an embodiment illustrated in
At 704, a sequence-processing neural network feature extractor (SPNNFE) is used to extract features from preceding vehicle battery state information. The SPNNFE can also be referred to more generally as a feature extractor, and in some embodiments is a RNN feature extractor.
An actor-critic model can be provided, such as those described above. In particular, the actor-critic model can have an actor model configured to produce an output (e.g., policy) associated with a charge command to optimally charge the vehicle battery. The actor-critic model can also have a critic model configured to output a predicted reward. At 706, the actor-critic model is trained based on (i) the vehicle battery state information, (ii) the extracted features, and (iii) a current applied to the battery (action).
The training step at 706 can include sub-steps, illustrated at 708-712. At 708, the weights of the actor model are updated to maximize the predicted reward output by the critic model. At 710, the weights of the SPNNFE and the weights o the critic model are updated to minimize a difference between (i) the predicted reward output by the critic model and (ii) health-based rewards received from charging of the vehicle battery. At 712, at least some of the hidden battery state information is approximated based on the extracted features in order to optimize charging of the vehicle battery.
It should be understood that while
Moreover, the teachings disclosed herein can be applied to environments outside of batteries. For example, the disclosed feature extractor and the reinforcement learning can be applied on any POMDP scenario where only some of the information can be observed, and some of the information is hidden. For example, the feature extractor can be used in retail scenarios in which the neural networks are configured to predict when a person where shop next, where the person will shop next, and what item the person will purchase next so that a proper recommendation can be given to that person. There may be countless variables that go into these decisions made by consumers, many of which are simply unobservable and therefore can be modeled with POMDP with the disclosed feature extractor.
The teachings herein can also be used in any control application approximated by a POMDP, which can arise frequently due to incomplete state information. Specifically, the feature extractor can be used for learning a policy for controlling a physical system and then operating the physical system where some of the state is not directly observable or easily computed. For example,
As another example,
It should also be understood that the scope of the invention is not limited to only actor-critic models in particular, unless otherwise stated in the claims. Instead, the teachings provided herein can apply to various forms of reinforcement models, such as on-policy reinforcement models, off-policy reinforcement models, or offline reinforcement models. In off-policy reinforcement learning models, the agent learns from the current state-action-reward information produced by the current best policy when it interacts with the environment. In off-policy reinforcement learning models, the agent learns from past experience that is stored in a replay buffer that grows as it interacts more with the environment. The state-action-reward values in the buffer do not correspond to the current best policy. In offline reinforcement models, the agent learns from past experience that is stored in a replay buffer that is static; there is no continued interaction with the environment. Offline is a special case of off-policy; an on-policy algorithm cannot be used offline.
In other embodiments, the control system may be for controlling a semi- or fully-autonomous vehicle, where hidden state information includes items of information regarding pedestrians, other vehicles, or road-specific data such as presence of potholes or faded lane lines, and the control system is configured to control (e.g., maneuver) the vehicle based on a reinforcement model that uses state information.
While exemplary embodiments are described above, it is not intended that these embodiments describe all possible forms encompassed by the claims. The words used in the specification are words of description rather than limitation, and it is understood that various changes can be made without departing from the spirit and scope of the disclosure. As previously described, the features of various embodiments can be combined to form further embodiments of the invention that may not be explicitly described or illustrated. While various embodiments could have been described as providing advantages or being preferred over other embodiments or prior art implementations with respect to one or more desired characteristics, those of ordinary skill in the art recognize that one or more features or characteristics can be compromised to achieve desired overall system attributes, which depend on the specific application and implementation. These attributes can include, but are not limited to cost, strength, durability, life cycle cost, marketability, appearance, packaging, size, serviceability, weight, manufacturability, ease of assembly, etc. As such, to the extent any embodiments are described as less desirable than other embodiments or prior art implementations with respect to one or more characteristics, these embodiments are not outside the scope of the disclosure and can be desirable for particular applications.
The present disclosure is related to the following applications which are filed on the same day as this application, and which are incorporated by reference herein in their entirety: U.S. patent application Ser. No. ______, titled REINFORCEMENT LEARNING FOR CONTINUED LEARNING OF OPTIMAL BATTERY CHARGING, attorney docket number 097182-00205U.S. patent application Ser. No. ______, titled SMOOTHED REWARD SYSTEM TRANSFER FOR ACTOR-CRITIC REINFORCEMENT LEARNING MODELS, attorney docket number 097182-00206