The present disclosure relates generally to tuning controllers, and more particularly to methods and apparatuses for automatically tuning regulatory controllers.
Regulatory controllers are used in a variety of different types of control systems to regulate operation of a number of different control system components in a wide variety of applications. Many regulatory controllers are poorly tuned for a given application, meaning that the control systems they regulate are often not operating efficiently. This can result in wasted energy, excessive wear of control system components, as well as numerous other problems. Manually tuning such regulatory controllers in the field can be tedious, error prone and time consuming, especially in systems controlled by numerous such regulatory controllers. What would be desirable is an automated way to tune such regulatory controllers in the field.
The present disclosure relates generally to tuning controllers, and more particularly to methods and apparatuses for automatically tuning regulatory controllers in the field. An example is a method of tuning a controller that is configured to control at least part of a process. During each of a plurality of iterations, a policy of the controller is updated and the at least part of a process is controlled using the updated policy. The updated policy is associated with a performance level of the controller in controlling the at least part of the process. For each iteration, the updated policy is determined using the associations generated during one or more previous iterations between the policies and the corresponding performance levels of the controller in controlling the at least part of the process, such that the updated policy is optimized to have a highest likelihood of producing a positive change in the performance level of the controller in controlling the at least part of the process rather than optimized to have a highest likelihood of producing a largest positive magnitude of change in the performance level of the controller in controlling the at least part of the process relative to the previous iteration.
Another example is a method of tuning a regulatory controller that is configured to regulate at least part of a process. During each of a plurality of iterations, one or more tuning parameters of the regulatory controller are updated, and the at least part of the process is regulated using the one or more updated tuning parameters. A performance of how well the regulatory controller controlled the at least part of the process is monitored. For each iteration, the one or more updated tuning parameters are determined based at least in part on the performance of how well the regulatory controller performed in controlling the at least part of the process during one or more previous iterations, such that the updated one or more tuning parameters are optimized to have a highest likelihood of producing a positive change in the performance of how well the regulatory controller controlled the at least part of the process rather than optimized to have a highest likelihood of producing a largest positive magnitude of change in the performance of how well the regulatory controller controlled the at least part of the process relative to the immediate previous iteration.
Another example is a controller for controlling at least part of a process. The controller includes a memory for storing a policy and a processor that is operatively coupled to the memory. The processor is configured to perform a plurality of iterations. During each iteration, the controller updates the policy of the controller and controls the at least part of the process using the updated policy. The controller associates the updated policy with a performance level of the controller in controlling the at least part of the process. During each iteration, the updated policy is determined using the associations generated during one or more previous iterations between the policies and the corresponding performance levels of the controller in controlling the at least part of the process, such that the updated policy is optimized to have a highest likelihood of producing a positive change in the performance level of the controller in controlling the at least part of the process rather than optimized to have a highest likelihood of producing a largest positive magnitude of change in the performance level of the controller in controlling the at least part of the process relative to the immediate previous iteration.
The preceding summary is provided to facilitate an understanding of some of the features of the present disclosure and is not intended to be a full description. A full appreciation of the disclosure can be gained by taking the entire specification, claims, drawings, and abstract as a whole.
The disclosure may be more completely understood in consideration of the following description of various illustrative embodiments of the disclosure in connection with the accompanying drawings, in which:
While the disclosure is amenable to various modifications and alternative forms, specifics thereof have been shown by way of example in the drawings and will be described in detail. It should be understood, however, that the intention is not to limit aspects of the disclosure to the particular illustrative embodiments described. On the contrary, the intention is to cover all modifications, equivalents, and alternatives falling within the spirit and scope of the disclosure.
The following description should be read with reference to the drawings wherein like reference numerals indicate like elements. The drawings, which are not necessarily to scale, are not intended to limit the scope of the disclosure. In some of the figures, elements not believed necessary to an understanding of relationships among illustrated components may have been omitted for clarity.
All numbers are herein assumed to be modified by the term “about”, unless the content clearly dictates otherwise. The recitation of numerical ranges by endpoints includes all numbers subsumed within that range (e.g., 1 to 5 includes 1, 1.5, 2, 2.75, 3, 3.80, 4, and 5).
As used in this specification and the appended claims, the singular forms “a”, “an”, and “the” include the plural referents unless the content clearly dictates otherwise. As used in this specification and the appended claims, the term “or” is generally employed in its sense including “and/or” unless the content clearly dictates otherwise.
It is noted that references in the specification to “an embodiment”, “some embodiments”, “other embodiments”, etc., indicate that the embodiment described may include a particular feature, structure, or characteristic, but every embodiment may not necessarily include the particular feature, structure, or characteristic. Moreover, such phrases are not necessarily referring to the same embodiment. Further, when a particular feature, structure, or characteristic is described in connection with an embodiment, it is contemplated that the feature, structure, or characteristic may be applied to other embodiments whether or not explicitly described unless clearly stated to the contrary.
The illustrative control system 10 includes a number of controllers 12 that are individually labeled as 12a, 12b and 12c. While a total of three controllers 12 are shown, it will be appreciated that this is merely illustrative, as the control system 10 may have any number of controllers 12 and may have a substantially greater number of controllers 12. In some instances, the controllers 12 may be part of a hierarchal control system that includes layers of control, with controllers at each control layer. The controllers 12 may be considered as being at a lowest or regulatory level in which each of the controllers 12 regulate operation of a corresponding piece of controlled equipment 14. The controlled equipment 14 is individually labeled as 14a, 14b and 14c. As shown, each controller 12 is operably coupled with a corresponding single piece of controlled equipment 14. In some cases, a single controller 12 may control two or more distinct pieces of controlled equipment 14. While a total of three pieces of controlled equipment 14 are shown in
The controlled equipment 14 may represent any of a variety of different controllable components. In an HVAC system, for example, each piece of the controlled equipment 14 may represent an actuatable HVAC component such as a hot water valve, an air damper, a Variable Air Volume (VAV) box or other Air Handling Units (AHUs). The control system 10 may be considered as including sensors 16, which are individually labeled as 16a, 16b and 16c. While a total of three sensors 16 are shown, it will be appreciated that this is merely illustrative, as the control system 10 may have any number of sensors 16. Each sensor 16 may be operably coupled with one or more of the controllers 12, and may provide feedback to the controller(s) 12 that permits the controller(s) 12 to more accurately regulate the corresponding piece(s) of controlled equipment 14.
If the piece of controlled equipment 14a is, for example, a hot water valve providing hot water on demand to a radiator, the sensor 16a may be a temperature sensor that reports a current room temperature to the controller 12a that is operably coupled with the piece of controlled equipment 14a. If the current room temperature is below a temperature setpoint for that room, the controller 14a may command the piece of controlled equipment 14a (in this case, a hot water valve) to open, or to open further if already open. When the current room temperature reaches or approaches the temperature setpoint for that room, the controller 14a may command the piece of controlled equipment 14a (in this case, a hot water valve) to at least partially close. This is just an example. In some cases, it may be appropriate to think about each piece of controlled equipment 14 as representing a single actuatable device that can be opened or closed, or turned up or turned down, in response to a command to do so from the corresponding controller 12, with the corresponding sensor 16 providing feedback to the controller 12 that enables the corresponding controller 12 to better regulate operation of the piece of controlled equipment 14. As can be seen, the delay between when the hot water valve is opened and when the room temperature changes may be dependent on the size of the room, the heat transfer efficiency of the radiator, the distance the sensor is from the radiator, as well as many other factors that are specific to the particular installation. Other factors such as how much the water valve should be opened and/or closed under different circumstances will often depend on the particular installation. These are just examples. As can be seen, in general, a controller that is generically tuned in the factory will often not be optimally tuned for a particular installation in the field.
In some instances, each of the controllers 12 may be operably coupled with a network 18. The network 18 may represent an internal network within a building or other facility. The network 18 may represent an internal network within a building, a factory or a refinery, for example. While the pieces of controlled equipment 14 are shown as being coupled directly to the corresponding controller 12, and are not shown as being coupled directly to the network 18, in some cases both the controllers 12 and the pieces of controllable equipment 14 may be directly coupled to the network 18. In this case, each controller 12 may communicate with its corresponding piece of controllable equipment 14 through the network 18. In some cases, the sensors 16 may also be directly coupled to the network 18, rather than to a corresponding controller 12.
In some instances, the control system 10 may communicate with a remote device 20 via a network 22. The network 22 may be considered as being an external network, and may for example rely on the Internet as being at least part of the network 22. In some cases, the network 22 may have a cloud-based component, represented by the cloud 24. The remote device 20 may be a computer that is remote from the facility in which the control system 10 is located. The remote device 20 may be a server such as a cloud-based server. In some instances, as will be discussed, the remote device 20 may be configured to receive data from the controllers 12 and be able to help fine tune operation of the controllers 12.
In some cases, the controllers at the regulatory control level 34 may be considered as being edge controllers, as seen by an edge controller 38. The edge controller 38 controls operation of the equipment at the controlled technology level 32 for which the edge controller 38 is responsible. The edge controller 38 may communicate with a cloud-based server 40. In some cases, and as will be discussed, the cloud-based server 40 may include a reinforcement learning block 42 that may help to fine tune the edge controller 38. In some cases, the edge controller and/or controller 36 may include a reinforcement learning block 42 to help fine tune the edge controller 38 instead of or in addition to the cloud-based server 40.
The processor 56 is configured to perform a plurality of iterations. During each iteration, the processor 56 updates the policy 54 of the controller 50 and controls at least part of the process using the updated policy 54 for a period of time. In some cases, the processor 56 may be configured to determine the updated policy 54 to use during each iteration. The processor 56 is configured to associate the updated policy 54 with a performance level of the controller 50 in controlling the at least part of the process. During each iteration, the updated policy 54 is determined using the associations generated during one or more previous iterations between the previous policies 54 and the corresponding performance levels of the controller 50 in controlling the at least part of the process, such that the updated policy is optimized to have a highest likelihood of producing a positive change in the performance level of the controller 50 in controlling the at least part of the process rather than optimized to have a highest likelihood of producing a largest positive magnitude of change in the performance level of the controller 50 in controlling the at least part of the process relative to the immediate previous iteration(s). In some instances, the processor 56 may be configured to communicate one or more parameters indicative of the performance level of the controller 50 in controlling the at least part of the process to the remote device 20 and to subsequently receive the updated policy 54 from the remote device.
In some cases, automated tuning may improve the performance of the controller 50. Reinforcement Learning (RL) may be used to help automatically tune the controller. One challenge with RL is that regulatory level controllers such as the controllers 12, the edge controller 38 and the controller 50, is that the regulatory level controllers may lack the processing power necessary to perform RL.
Generally, RL is a form of artificial intelligence that is concerned with optimizing the behavior of an RL agent, or to maximize the return for the RL agent. RL can describe many real-world decision-making problems including optimizations of company business profits, online auctions, election campaigns, computer or board games strategies, air combat problems, robotics etc., and has been successfully applied in many of these areas. In RL problem formulation, the RL agent is interacting with an uncertain environment that changes its state with time as a result of actions of the RL agent as well as intrinsic system dynamics.
The RL agent usually operates in the discrete time domain periodically. For every discrete time instant, the agent may choose an action a based on the state of environment x and its policy π. The agent receives a reward r(a, x) which depends on the action chosen and the current state of environment. Subsequently, the environment state y will be partly affected by the actions taken previously. The optimal behavior of the agent should account not only for the immediate rewards but it should also consider the future impacts of the actions on the state of the environment. The optimal agent's behavior should involve the capability of planning. According, RL theory is often concerned with finding the optimal agent's policy when no model of the environment is available. It is an algorithm, which would use the previous observations, environment states, actions and rewards as the input data. It does not rely on other information, i.e. it is purely empirical. This fact, that the optimization does not rely on various assumptions, makes the RL a promising method for solving the regulatory control problem. RL calculates an approximation of the optimal policy. The policy is a function that maps the environment states to agent's actions. It can often be represented by a table, in which the agent may look up the optimal action to choose based on the current state.
In a contemplated regulatory control regime, the agent chooses an action, e.g. the valve position for the next few seconds. This time can be called an evaluation period. The reward received for this action may be a combination of the temperature control accuracy and the valve position change over the evaluation period. Ideal control achieves good control accuracy with minimum actuator moves. In some instances, on the regulatory control level, the energetic efficiency of the building will not be directly considered as this problem will be solved on higher levels of the control hierarchy.
The RL can be implemented using value functions, e.g. an advantage function. Another popular value function is state-action value function called Q-function. The results should be identical regardless whether the advantage or Q-function is used. The advantage function is more convenient for the present discussion.
The advantage function is defined using the state-value or cost-to-go function Vπ(x). In this example, the state-value Vπ(x) is defined as the expected (i.e. statistically expected) agent's return when starting with the environment at state x and pursuing a given policy a=π(x). The agent's policy is a function, possibly randomized, mapping the states of environment to the agent's actions a. Then, the advantage Aπ(a, x) of an action a at a state x with respect to the baseline policy π is the difference between two costs-to-go: (1) the return expected when using the specified action a at the initial state x before switching to the baseline policy π minus (2) the return expected when following the baseline policy from x. Formally:
Aπ(α,x)=r(α,x)+E{Vπ(y)|α,x}−Vπ(x)
The advantage is the expected return difference caused by one-step variation in a given policy. The instantaneous reward received at state x is denoted as r(a, x). Per the above definitions, Vπ(x) is the return when following the baseline policy π from x, whereas r(a, x)+Vπ(y) would be the return when applying action a at state x and causing the next environment state to be y. Because the next state y is a random variable due to a non-deterministic environment, it is necessary to take the conditional expectation E{Vπ(y)|a, x} instead of simply Vπ(y).
The advantage function has the following properties which make this function useful in finding the optimal policies:
In what is known as a greedy RL approach, the RL agent attempts to maximize the magnitude of positive change relative to the previous iteration. In the greedy approach:
The advantage function may be estimated from the data using approximation techniques to fit the observed data {[xi, yi, αi, ri], i=1, 2 . . . }. These techniques can involve least squares optimization. The data are obtained by trying various actions at various states. This advantage estimation (or Q-function estimation) is the key element of many RL algorithms. The optimal policy is found when advantage function becomes known. In reality, it can only be approximated based on a finite data set that is available. Hence, the reinforcement learning is a process of converging to the optimal policy but generally not achieving it in a finite time.
When the environment state is not completely known to the RL agent, the whole relationship between actions, states and rewards may get obscured and the learning process may not converge or its rate of convergence may be compromised. This makes RL application for regulatory control often difficult and possibly unreliable.
In the regulatory control case, not all environment states are available. As a general rule, so-called transient states will be unknown. This can be illustrated using an example: the controller opens a hot water valve more. This action does not start increasing the controlled temperature immediately. At first, the heat increases the temperature of the heat exchanger, then the heat exchanger increases the temperature of the air around the heat exchanger, which is mixed with the air in the room, which will finally increase the temperature of the sensor body. Only then the algorithm will detect the change. There will be a delay. Only after a time (maybe several minutes), the change of the controlled temperature trend can be clearly noticed by the algorithm. The controlled temperature is the state that is at the end of the causal chain.
Suppose what happens if the RL agent opens the valve but instead of waiting a sufficient time to notice the temperature trend change, it tries a new action too soon: it closes the valve this time. At this moment, the heat released by the previous opening action will arrive to the sensor. The agent will now conclude that the hot water valve closing (current action) makes the air temperature increase (which is in fact an effect of the previous action). Unfortunately, the conclusion is grossly incorrect and will have catastrophic impact on the controller performance. The trouble is that the environment state is also containing intermediate states xi not included in x known to the RL agent.
Accordingly, short evaluation periods may not be optimal. Rather, it may be better to sacrifice the speed of learning in favor of robustness using a sufficiently long evaluation period, e.g. several minutes instead of one second (one second may be a typical sampling period used in BMS regulatory control layer). The extended period will effectively eliminate most problems stemming from unknown intermediate states. Any states which settle down in less than a minute will then not cause a problem. The knowledge of controlled states will then be sufficient.
The disadvantage of extended evaluation periods is that the process will be uncontrolled for more than one minute, i.e. the agent will set the valve position and will not be allowed to change it for next few minutes. It will be regarded as unacceptable for many regulatory control loops. The control will be irresponsive.
The extended evaluation period idea can still be used if the agent's action is not interpreted as choosing a valve position but choosing a control law. Testing an action then means running the controller with fixed parameters over the evaluation period. Then the process may be controlled always using a short sampling period, just the controller parameters will be updated only occasionally.
Running a fixed controller for a sufficiently long period effectively eliminates problems with unknown intermediate states provided the controller is stabilizing the process and thus attenuating the effect of the intermediate states. The situation changes when the controller causes loop instability. Then the effects do not vanish over the testing period even if the period would be arbitrarily long. For the above reason, an extended evaluation period cannot be viewed as the ultimate solution to the problem.
Many potential RL applications problems caused by the unknown intermediate states could be eliminated by two choices:
As noted, the above two choices create a new problem: the advantage function estimate will be grossly affected by feedback loop instability which will be amplifying the intermediate effects instead of attenuating it. The longer the evaluation period, the more the y state will depend on xi. Moreover, the instable control is likely to hit some nonlinearity or saturation throughout the evaluation period: e.g. the valve will be either fully open or fully closed. These effects make the data obtained from such evaluation period contradictory, non-repeatable and often difficult to model. The situation is that:
This situation resembles the role of outliers known in various problems in mathematical statistics, e.g. regression analysis. It is known that least squares estimators provide a consistent parameter estimates for many statistical models. On the hand, it is known that the least squares estimators are very inefficient when the probability distribution of errors is not normal, especially when large errors are more likely to occur. A handful of outliers may make the least squares estimates inaccurate. A solution to the outlier problem is to minimize other function than the sum of squares. The sum of Tukey's biweight (also known as bisquare) functions is a known method. Biweight behaves like the squared error function at first, but for larger errors, the function becomes constant. In this way, the sensitivity to outliers is limited. Biweight is just one example of a wider class of robust estimators developed in robust statistics.
For the RL regulatory control problem solved, any regulatory loop instability behaves in a way like outliers: it produces bad data to be used for the advantage function estimation which cause the advantage estimate to be inaccurate.
An example solution in accordance with this disclosure uses a modified policy that is based on the advantage function sign (positive, negative) ignoring its absolute value. A proposed method updates the policy taking the action which has the highest probability of bringing a positive advantage over the baseline policy instead of those which bring the largest positive magnitude in advantage. This may be implemented by maximizing the sign of the advantage instead of its value:
Or possibly a soft continuous version of the sign function σ(Aπ) to avoid problems with discontinuity:
This choice still secures the convergence to the optimal policy, although the convergence rate may be slower compared to the greedy approach in ideal conditions (without outliers). At the same time, this choice is less sensitive to outliers, i.e. effects of the unknown process states.
Because it does not use the value of the advantage function but just its sign, the illustrative non-greedy method effectively classifies the actions into two categories: the actions that make the return better versus the actions that make the return worse (at an environment state). Then any of the former actions are adopted by the next policy iteration. The optimization may prefer the actions that improve the policy with the high probability. This improves the robustness of the approach even further.
As an example, the RL problem may be simplified considering a finite horizon control. The agent starts with the environment at a state x and terminates at the next state y. At this state, the return is terminated and no future rewards are considered. The advantage function can be written without considering the Vπ(x) explicitly as:
Aπ(α,x)=r(α,x)+E{r(α,y)|α,x}−r(π(x),x)−E{r(π(y),y)|π(x),x}
This simplifies the problem: the advantage function estimate can be consistently approximated simply averaging N samples instead of considering the statistical expectation. First, define the empirical cost-to-go:
Vα(x)=r(α,x)+(π(y),y)
Vπ(x)=r(π(x),x)+r(π(y),y)
Then the empirical advantage sample is the difference between those two costs
The average is an empirical advantage datum obtained by testing an action N times and observing the costs. Consider the actual advantage function at the current initial state is x
Aπ(α,x)=1−16α2
From here, the optimal action is clearly zero. Suppose the empirical advantage converges to the actual advantage for N→∞ but the rate of convergence is much slower for suboptimal actions. This represents a similar mechanism like the regulatory control instability: it is much harder to determine the actual advantage or actual disadvantage for the suboptimal destabilizing controllers because these will be very sensitive to the intermediate states as well to the process nonlinearities and other complex effects.
The purpose of this example is to visualize the difference between expected advantage maximization versus the expected advantage sign maximization. This can be seen in
It cannot be concluded that the proposed non-greedy approach converges 100 times faster in general because this example is artificial in the sense that the outliers were emphasized. However, it is a valid conclusion that maximizing the average advantage “sign” is significantly more robust in the presence of outliers. It may be noted how the simplified example differs from the typical regulatory control example. The regulatory control problem is not a finite horizon problem. The advantage function will not be estimated by a least-squares fitting algorithm instead of simple reward averages.
The illustrative sign-based approach improves the RL robustness and generally provides a faster convergence rate. Implementing such an approach on embedded computer hardware commonly used in a regulatory control layer may be difficult, depending on the processing power available at the regulatory control layer. In some cases, some or all of the algorithm may be performed on more powerful hardware such as on a server or in the cloud.
While RL could be implemented by sending the process data to the server every sampling period, including the current controlled variable, set-point and the manipulated variable (e.g. valve position), this can represent a significant amount of data such as about 1 Mbyte per day per controller supposing single precision arithmetic and 1 second sampling period. Accordingly, and in some cases, the advantage function estimator does not use the raw data, but instead uses the initial and the terminal states x, y, the action a used over that evaluation period and the reward r(a, x). If the states are approximated with the control error and the action is representing PID gains, this would represent only about 33.75 kbyte per day per controller supposing single precision arithmetic and 1 minute evaluation period, which would present about a 30× data reduction.
In an example implementation, the regulatory control edge device runs multiple PI and PID controller algorithms or similar fixed structure controllers each parameterized with a finite number of values. In the case of PID, the controller gain, integration and derivative time constant may represent the controller tuning vector of the control policy. At any time, the edge device may hold a tuning vector currently representing the best-known values which can be denoted a*. To achieve the autonomous optimization of the tuning vector, the edge device applies random perturbations to these currently best-known tuning values. The magnitude of the perturbations may be optimized but more often a reasonably small randomized perturbation ±10% may suffice. Such perturbations may be numbered by an index i. In terms of RL, each such perturbation represents an action of the agent. Each perturbation is applied for a sufficiently long evaluation period to minimize the effects of the intermediate states. At the evaluation period start, the initial state xi of the process is recorded. This xi involves only the observable states, the unknown states are ignored. In regulatory control, xi is often the initial control error, sometimes the control error and its derivative. During the evaluation period, the edge device integrates the instantaneous rewards to evaluate the tuning performance associated with the period: rti. At the evaluation period end, the process terminal state yi is recorded and the three items are send to the hardware running the RL algorithm along with the actual tuning αi as a single record. Thus, the record #i may include the following items:
The reward aggregation for a typical regulatory control problem will include the summation of terms related to the control error and actuator activity. Usually the following two terms may be used:
ri(t+1)=ri(t)−(ycv(t)−ysp(t))2−ρ(umv(t)−umv(t−1))2,
where ycv(t), ysp(t) are the controlled variable and its set-point respectively and umv(t) is the manipulated variable (controller output) at time t. The non-negative p is a tuning parameter used to define the optimal speed of response.
The hardware running the RL algorithm aggregates the records [xi, yi, αi, ri] and uses them to calculate the cost-to-go function V0(x) which represents the expected return as a function of the process state averaged over the tuning values tested so far. Such cost-to-go represents a baseline performance of the edge device controller when using the current tuning values a* including their random perturbations. If nothing would have changed, this would be the performance of the controller. It can be described as “historical performance.”
The V0(x) or cost-to-go function estimation is a standard problem known in RL. A reasonable approach is the Least-Squares Temporal Difference Learning. It is known that V0(x) function is a multivariable quadratic function in case a) the controlled process is linear b) the reward function is a quadratic function of the process state and the controller output. Such approximations are often reasonable for PID regulatory controllers. If it is the case, the V0(x) estimation algorithm will be like a quadratic polynomial regression.
After having estimated V0(x), the proposed algorithm calculates the advantage values achieved by the tested tuning values αi at all initial process states xi. Each test record issues one such advantage value:
Ai0=ri+V0(yi)−V0(xi)
Positive Ai indicate evaluation periods during which the edge device performed above average and vice versa. The algorithm uses such data to classify the actions (tuning vectors) into two classes: above average (or average at worst) Ai0≥0 and below average Ai0<0. This classification is in fact a model of the Ai0 sign. The tuning values which performed below average can now be rejected and eliminated from the data. In the next iteration, the improved cost-to-go can be calculated V1(x) not accounting for the rejected evaluation periods. The further improvement is achieved classifying the perturbations into below versus above average with respect to V1(x) using the refined advantage values Ai1. This process finally converges to an Ain after n iterations presumably approximating the advantage function of the optimal policy, i.e. Ain≥0. It can be noted that while the advantage values are calculated even for eliminated periods at every iteration, the elimination concerns only the cost-to-go calculations.
The optimal controller tuning is finally defined as an action classified as being not below average with the highest possible probability:
This method would produce a controller tuning of which depends on the process state. However, simple controllers like PID are more frequently described by tuning values which are constant, independent on the process state. This can be overcome by eliminating the state x, e.g. averaging it:
In this way, the tuning vector which performs optimally on average is preferred instead of a state—dependent optimal tuning. Sometimes, the tuning dependency on the state may be desirable. Finally, the above calculated a* representing an improved controller tuning vector is sent back to the edge device. There, it replaces the current values and the edge device starts applying it including the randomized perturbations. This process may be repeated going forward. In this way, the controller tuning is permanently adapting to the changing environment.
The advantage function (or other value function like Q-function) based reinforcement learning is a standard machine learning method. All standard RL algorithm assume that complete state observation is available, and the state transition depends on the current state and the action (Markovian assumption). Partially observable Markov decision process (POMDP) is a generalization of Markov decision process (MDP) that incorporates the incomplete state observation model. It turns out that POMDP can be treated as the standard MDP using the belief state as opposed to the unknown state. The problem is that RL formulated for the belief state is complicated even for simple problems. For this reason, specific algorithms and approximations have been developed for POMDP learning. The present disclosure can be viewed as a simple heuristic solution to this complicated problem.
The disclosed approach does not address the unknown states problem directly. Rather, it proposes to extend the evaluation period, i.e. the time an action is applied. Over an extended period, the unknown initial condition may typically become negligible. However, this works with stable controllers. Unstable controllers run for an extended time amplify the unknown initial condition. The disclosure addresses this by modifying the approach so that the likelihood that the new action is better (has a positive advantage) is maximized as opposed to the standard maximization of the advantage value. This makes the method more robust. The unstable controllers do not yield consistent advantage results. The advantage values observed by running unstable controllers will have large variance. However, their advantage values will not be consistently positive.
The at least part of the process is controlled using the controller with the updated policy, as indicated at block 106. The updated policy is associated with a performance level of the controller in controlling the at least part of the process, as indicated at block 108.
As indicated at block 110, and for each iteration, the updated policy is determined using the associations generated during one or more previous iterations between the policies and the corresponding performance levels of the controller in controlling the at least part of the process, such that the updated policy is optimized to have a highest likelihood of producing a positive change in the performance level of the controller in controlling the at least part of the process rather than optimized to have a highest likelihood of producing a largest positive magnitude of change in the performance level of the controller in controlling the at least part of the process relative to the previous iteration.
In some cases, and for each iteration, the updated policy may be determined using reinforcement learning based on an advantage function, and wherein the updated policy is based on a sign of the advantage function and not an absolute value of the advantage function. During each of the plurality of iterations, the at least part of the process is controlled using the controller with the updated policy for at least a period of time, wherein the period of time is sufficient to allow a measurable response to control actions taken by the controller in accordance with the updated policy. In some cases, the controller is an edge controller operatively coupled to a remote server, and the updated policy is determined by the remote server and communicated down to the controller before the controller controls the at least part of the process using the updated policy.
In some instances, and for each iteration, the updated one or more tuning parameters may be determined using reinforcement learning based on an advantage function, and wherein the updated one or more tuning parameters are based on a sign of the advantage function and not an absolute value of the advantage function. Controlling the at least part of the process using the regulatory controller with the updated one or more tuning parameters may be performed for at least a period of time, wherein the period of time is sufficient to allow a measurable response to control actions taken by the regulatory controller in accordance with the updated one or more tuning parameters. The one or more tuning parameters may include one or more of a Proportional (P) gain, an Integral (I) gain and a Derivative (D) gain.
The regulatory controller may be configured to control an HVAC actuator of an HVAC system. In some cases, the regulatory controller may be an edge controller operatively coupled to a remote server, and wherein the updated one or more tuning parameters are determined by the remote server and communicated down to the regulatory controller before the regulatory controller controls the at least part of the process using the updated one or more tuning parameters.
Those skilled in the art will recognize that the present disclosure may be manifested in a variety of forms other than the specific embodiments described and contemplated herein. Accordingly, departure in form and detail may be made without departing from the scope and spirit of the present disclosure as described in the appended claims.
Number | Name | Date | Kind |
---|---|---|---|
4669040 | Pettit | May 1987 | A |
6198246 | Yutkowitz | Mar 2001 | B1 |
6264111 | Nicolson | Jul 2001 | B1 |
6546295 | Pyotsia | Apr 2003 | B1 |
7346402 | Stahl | Mar 2008 | B1 |
8001063 | Tesauro et al. | Aug 2011 | B2 |
8694131 | Burns | Apr 2014 | B2 |
10416660 | Maturana | Sep 2019 | B2 |
20030149675 | Ansari | Aug 2003 | A1 |
20050137721 | Attarwala | Jun 2005 | A1 |
20120215326 | Brown | Aug 2012 | A1 |
20150355692 | Paul | Dec 2015 | A1 |
20170314800 | Bengea | Nov 2017 | A1 |
20180012137 | Wright et al. | Jan 2018 | A1 |
20190187631 | Badgwell | Jun 2019 | A1 |
Number | Date | Country |
---|---|---|
2122472 | Dec 2000 | CA |
107783423 | Mar 2018 | CN |
109600448 | Apr 2019 | CN |
1422584 | May 2004 | EP |
2019022205 | Feb 2019 | JP |
Entry |
---|
Auof et al; “A PSO algorithm applied to a PID controller for motion mobile robot in a complex dynamic environment,” In Proceedings of the International Conference on Engineering & MIS (ICEMIS), Monastir, Tunisia, May 8-10, 2017. |
Bagnell et al; “Autonomous helicopter control using reinforcement learning policy search methods,” In International Conference Robotics and Automation. IEEE, 2001. |
Busoniu et al; “Reinforcement Learning for Control: Performance, Stability, and Deep Approximators,” Annual Reviews in Control, 24 pages, 2018. Abstract. This manuscript version is made available under the CC-BY-NC-AD 4.0 license http://creativecommons.org/licenses/by-nc-nd/4.0/. |
Forsell et al; “Closed-loop Identification, Methods, Theory, and Applications,” U Linkoping Studies in Science and Technology. Dissertations, Department of Electrical Linkoping University, SE-581 83 Linkoping, Sweden Linkoping, 1999. |
Jana et al; Tuning of PID Controller for Reverse Osmosis (RO) Desalination System Using Particle Swarm Optimization (PSO) Technique. Int. J. Control Theory Appl. vol. 10, pp. 83-92, 2017. |
Karkus et al; “QMDP-Net: Deep learning for planning under partial observability,” in Advances in Neural Information Processing Systems, pp. 4697-4707, 2017. |
Kumar, “A Survey of Some Results in Stochastic Adaptive Control,” SIAM Journal on Control and Optimization vol. 23, No. 3, pp. 219-380, 1985. |
Quianshi et al; “Adaptive PID Controller Based on Q-learning Algorithm,” IET Research Journals, The Institution of Engineering and Technology, pp. 1-11, 2015. |
Mahmoudi, B., et al., “Towards autonomous neuroprosthetic control using Hebbian reinforcement learning,” Journal of Neural Engineering, 10 (2013) 066005 (15 pp). |
Extended European Search Report, EP Application No. 21217649.9, dated Jun. 9, 2022 (11 pgs). |
Number | Date | Country | |
---|---|---|---|
20220214650 A1 | Jul 2022 | US |