The present invention relates generally to monitoring and controlling of heating, cooling and air conditioning (HVAC) systems, and more particularly to implementing a strategy for efficiently utilizing a heat-pump based HVAC system with an auxiliary heating system.
According the United States Department Of Energy, 40% of the energy consumed in the United States is consumed by residential (22%) and commercial (18%) buildings. Furthermore, heating, ventilation and air conditioning (HVAC) systems are responsible for more than 50% of the energy consumed by buildings. With the efforts of moving to sustainable energy consumption, heat-pump based HVAC systems have gained popularity due to their high efficiency and due to the fact that they are powered by electricity as opposed to being powered by gas or oil.
One drawback of heat-pump based HVAC systems is that their efficiency sharply decreases when the outdoor temperature is around or below freezing. As a result, heat-pump based HVAC systems are backed up by an auxiliary heating system that is effective in cold weather, but that consumes about twice as much energy.
A popular way of saving energy in HVAC systems is “setting back the thermostat” referring to relaxing the heating/cooling requirements when the occupants are not occupying the home/office/building. Such practice though may increase the energy consumption in a heat-pump based HVAC system since recovering the temperature back frequently results in excessive use of an energy expensive, electric-resistance auxiliary heater.
As a result, there is not currently a means for minimizing energy consumption by efficiently utilizing a heat-pump based HVAC system while satisfying the comfort requirements of the occupants.
In one embodiment of the present invention, a method for efficiently utilizing an HVAC system comprises selecting each of a plurality of possible actions over a first period of time. The method further comprises recording effects of selecting actions in terms of a data set of tuples during the first period of time. The method additionally comprises selecting a model to fit a regression using regression features during a second period of time, where the regression features comprise a current indoor temperature, a current outdoor temperature and a plurality of historic indoor temperatures. Furthermore, the method comprises fitting the regression to model a transition function for each of the plurality of possible actions using the data set of tuples during the second period of time. Additionally, the method comprises determining, by a processor, an action to take using a lookahead planning approach of the selected model during the second period of time for every time-step within each sub-period of the second period of time until an end of the sub-period of the second period of time, where the time-step corresponds to a fixed segment of time within the second period of time, and where the action corresponds to implementing one of the plurality of possible actions. In addition, the method comprises recording effects of selecting actions in terms of the data set of tuples during the second period of time.
Other forms of the embodiment of the method described above are in a system and in a computer program product.
The foregoing has outlined rather generally the features and technical advantages of one or more embodiments of the present invention in order that the detailed description of the present invention that follows may be better understood. Additional features and advantages of the present invention will be described hereinafter which may form the subject of the claims of the present invention.
A better understanding of the present invention can be obtained when the following detailed description is considered in conjunction with the following drawings, in which:
While the following discusses the present invention in connection with a strategy for efficiently utilizing a heating, ventilating, and air conditioning (HVAC) system with an auxiliary heating system, the principles of the present invention may be applied to an HVAC system without an auxiliary heating system. A person of ordinary skill in the art would be capable of applying the principles of the present invention to such implementations. Further, embodiments applying the principles of the present invention to such implementations would fall within the scope of the present invention.
In the following description, numerous specific details are set forth to provide a thorough understanding of the present invention. However, it will be apparent to those skilled in the art that the present invention may be practiced without such specific details. In other instances, well-known circuits have been shown in block diagram form in order not to obscure the present invention in unnecessary detail. For the most part, details considering timing considerations and the like have been omitted inasmuch as such details are not necessary to obtain a complete understanding of the present invention and are within the skills of persons of ordinary skill in the relevant art.
Referring now to the Figures in detail,
Referring now to
Referring again to
In one embodiment, control unit 104 may further include a communications adapter 309 coupled to bus 302. Communications adapter 309 interconnects bus 302 with an outside network thereby enabling control unit 104 to obtain weather forecasts as well as communicate with thermostat 102 (
The present invention may be a system, a method, and/or a computer program product. The computer program product may include a computer readable storage medium (or media) having computer readable program instructions thereon for causing a processor to carry out aspects of the present invention.
The computer readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. A non-exhaustive list of more specific examples of the computer readable storage medium includes the following: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), a static random access memory (SRAM), a portable compact disc read-only memory (CD-ROM), a digital versatile disk (DVD), a memory stick, a floppy disk, a mechanically encoded device such as punch-cards or raised structures in a groove having instructions recorded thereon, and any suitable combination of the foregoing. A computer readable storage medium, as used herein, is not to be construed as being transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission media (e.g., light pulses passing through a fiber-optic cable), or electrical signals transmitted through a wire.
Computer readable program instructions described herein can be downloaded to respective computing/processing devices from a computer readable storage medium or to an external computer or external storage device via a network, for example, the Internet, a local area network, a wide area network and/or a wireless network. The network may comprise copper transmission cables, optical transmission fibers, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. A network adapter card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium within the respective computing/processing device.
Computer readable program instructions for carrying out operations of the present invention may be assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state-setting data, or either source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C++ or the like, and conventional procedural programming languages, such as the “C” programming language or similar programming languages. The computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider). In some embodiments, electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present invention.
Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer readable program instructions.
These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, a programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable storage medium having instructions stored therein comprises an article of manufacture including instructions which implement aspects of the function/act specified in the flowchart and/or block diagram block or blocks.
The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable apparatus or other device to produce a computer implemented process, such that the instructions which execute on the computer, other programmable apparatus, or other device implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts or carry out combinations of special purpose hardware and computer instructions.
As stated in the Background section, with the efforts of moving to sustainable energy consumption, heat-pump based HVAC systems have gained popularity due to their high efficiency and due to the fact that they are powered by electricity as opposed to being powered by gas or oil. One drawback of heat-pump based HVAC systems is that their efficiency sharply decreases when the outdoor temperature is around or below freezing. As a result, heat-pump based HVAC systems are backed up by an auxiliary heating system that is effective in cold weather, but that consumes about twice as much energy. A popular way of saving energy in HVAC systems is “setting back the thermostat” referring to relaxing the heating/cooling requirements when the occupants are not occupying the home/office/building. Such practice though may increase the energy consumption in a heat-pump based HVAC system since recovering the temperature back frequently results in excessive use of an energy expensive, electric-resistance auxiliary heater. As a result, there is not currently a means for minimizing energy consumption by efficiently utilizing a heat-pump based HVAC system while satisfying the comfort requirements of the occupants. Similarly, for cooling in the summer, setting back saves energy, but one needs to determine what times to cool in advance to avoid a comfort violation and minimize energy.
The principles of the present invention provide a means for efficiently utilizing energy consumption in a heat-pump based HVAC system while satisfying comfort requirements of the occupants as discussed below in connection with
As will be discussed further below, the principles of the present invention involve designing a complete reinforcement learning (RL) agent (program, such as application 304, of control unit 104 that efficiently utilizes heat-pump based HVAC system 101) that learns and applies a new adaptive control strategy for a heat-pump thermostat that (1) leads to roughly 7.0%-14.5% yearly energy savings in a realistic simulation of different house sizes and weather conditions, while (2) keeping the occupants' comfort level unchanged when compared to an existing strategy that is deployed in practice. Experiments are run using a complex, realistic simulator written for the United States Department of Energy. The strategy discussed herein simultaneously solves two related, but slightly different problems of heating in the winter and cooling in the summer. The agent of the present invention is realistically deployed in a simulated, unknown in advance house, and after three days of exploration (during which occupants could be traveling out of home) starts to save energy. The agent of the present invention makes decisions in real-time, and keeps learning and improving performance while acting, as it gathers more data.
In order to apply reinforcement learning (RL) to thermostat control, the problem was defined as a continuous state Markov Decision process (MDP). After randomly exploring the effects of its actions during the first three days, the agent uses a regression learning algorithm to fit a transition function that models the house in which the agent operates. Using information like weather forecast and history of past measurements results in a high-dimensional MDP state, and therefore it is impractical to plan, or compute a value function, over the whole state space. Therefore, the agent uses an efficient online lookahead policy, based on a constrained, specialized tree-search.
Prior to discussing the process for efficiently utilizing a heat-pump based HVAC system 101 (
Since real-world experiments would both be costly and take too much time, a complex, realistic HVAC simulation was relied upon to test the thermostat strategy of the present invention. Specifically, GridLAB-D, which is an open-source, smart-grid simulator that was developed for the United States Department of Energy, was used. Importantly for the purposes of the present invention, it has a residential building model, that includes heat gains and losses and the effects of thermal mass, as a function of weather (temperature and solar radiation), occupant behavior (thermostat settings and internal heat gains from appliances), and heating/cooling system efficiencies. It models parallel heat flow paths through the envelope of the building (walls, windows, doors, ceilings, floors, and infiltration air flows) and considers the mass of the air in the interior volume of the house. It uses meteorological data collected in hundreds of cities across the U.S. by the National Renewable Energy Laboratory, recorded in a standard TMY2 (Typical Meteorological Year) file format.
The simulation of the present invention uses a heat-pump based HVAC system. At its peak performance, a heat-pump can output heat energy that is 4 times higher than the energy it consumes. However, when outdoor temperatures are near or below freezing, its efficiency sharply decreases; therefore, it is backed up by an auxiliary heater, which is represented by a resistive heat coil in the simulation. On one hand, the auxiliary heater's efficiency is almost unaffected by the outdoor temperature, but on the other hand, it consumes about twice the energy consumed by the heat-pump heater. A resistive heat coil is a popular backup system, partly due to the expected entrance of renewable electricity sources to the market. A heat-pump is also used for cooling, and is not backed up by an auxiliary cooler.
From an artificial intelligence perspective, the focus of the present invention is on the decision making module of the house's HVAC system, namely the thermostat, such as thermostat 102 (
In the setup of the present invention, a single-family residential home is simulated, and the occupants are assumed to be at home between 6 pm and 7 am of the next day, and that the house is empty between 7 am and 6 pm (referred to herein as the “don't care period”). Furthermore, the don't care period may specify a range of inside temperatures, such as not being greater than 100° F. or less than 40° F. during the don't care period. In the embodiment directed to an HVAC system being implemented in a commercial building, the don't care period is throughout the night. The goal is to minimize the total energy consumed by HVAC system 101, while (1) keeping a desired temperature range of 69-75° F. whenever the occupants are at home, and (2) being indifferent to the home temperature during the don't-care period.
Under this setup, a straightforward setback strategy would be to turn the system off during the don't-care period, and turn it back on once occupants are at home. However, such setback of a thermostat that uses such a strategy can in fact increase consumption by more than 7% comparing to just leaving it always on. The main reason for this is that at the end of the don't-care period, the temperature is often significantly out of range, in which case the strategy forces extended use of the energy-expensive auxiliary heating unit. More regular use of the heat pump throughout the don't-care period ends up consuming less energy, as happens when leaving the thermostat always on. An additional problem with such setback is that requirement (1) is frequently violated, since it might take up to several hours for the HVAC system to bring the temperature back to the desired range.
However, setting back the temperature is still desirable for saving energy, as long as it does not cause unnecessary use of auxiliary heating. This is due to the fact that setting back gets the indoor temperature closer to the outdoor temperature, which in turn slows the heat dissipation, so that less energy is needed to compensate for heat energy losses. Therefore, an ideal strategy would be able to predict whether it is possible to set back the thermostat for some time, then start heating enough in advance using the heat-pump whenever possible and auxiliary heating when unavoidable, so as to reach the desired temperature by the time the occupants are back, thus allowing the temperature setback to effectively save energy, while leaving the occupants' comfort unchanged. As discussed further herein, such a strategy is defined and tested.
The challenge is to design a control strategy that would be able to approximate this path for each house the agent of the present invention is deployed in, and for any weather conditions. Note that to this point, the focus has been on winter, which is more complicated than summer due to the two different heating actions. In fact, the strategy of the present invention works in the summer as well, where there is only one cooling action (so no need to avoid a more expensive action), but where there is still the challenge of setting back the thermostat to save energy, and start cooling in advance to bring the temperature back to range on time. In the experiments of the present invention, tests have been run throughout the year, thus testing both conditions simultaneously.
It has been assumed that the default thermostat strategy is used to keep the temperature in range whenever occupants are at home, in order to keep a similar comfort level across all tested strategies (so that only energy usage differs), and due to the lower potential for energy savings at these times. Therefore, changing the thermostat strategy during the don't-care period is only considered.
A discussion regarding utilizing such a process for controlling thermostat 102 (
Referring to
In step 502, control unit 104 records the effects of selecting actions in terms of a data set of tuples.
Upon recording the effects of selecting actions in terms of a data set of tuples, control unit 104 continues to select each of the possible actions (e.g., cooling, off, heat-pump heating and auxiliary heating) over the exploratory period of time in step 501.
As will be discussed further below, during the initial period of time that the learning agent is deployed, such as three days, the effects of selecting each of the possible actions are recorded in terms of tuples of data. After this initial period of time, control unit 104 executes an energy saving set-back policy as discussed below in connection with
Thermostat 102 works in the real-time cycle of sensing the world state, for instance the temperature and the time of day; running some computations; and acting by choosing one of four actions: cooling, off, heat-pump heating, or auxiliary heating. A strategy's goal is to minimize a cost function, which is the total energy it uses over some period, while satisfying a desired comfort level. Formally, this problem can be represented as a Markov Decision Process (MDP). An (episodic) MDP is a tuple (S, A, P, R, T), where S is the set of states; A is a set of actions; P:S×A×S→[0, 1] is a state transition probability function where P(s, a, s′) denotes the probability of transitioning to state s′ when taking action a from state s; R:S→R is a state-based reward function; and TεS is a set of terminal states, where entering one of which terminates an episode. The MDP of the present invention is defined as follows:
A discussion regarding the choice of state representation is provided below. In the MDP of the present invention, an action is taken every 6 minutes, as the simulator models a realistic lockout of the system, such that every control action is applied for at least 6 minutes. In the context of MDPs, the goal of RL is to learn an optimal policy, when the model (namely P and/or R) is initially unknown. A policy is a mapping π: S→A from states to actions, and an optimal policy is defined as one that maximizes the long-term rewards, or equivalently minimizes the long-term costs, from every state.
When the agent is deployed in a new house, in order to perform robustly it needs to learn the characteristics of the specific house and heating system it controls, and adapt its control strategy to these characteristics. It does so by exploring and learning the effects of its actions in the house's environment for three simulated days. During this period, the agent selects each of the four possible actions and records their effects. While in practice it might be possible to use a more advanced exploration policy, for the purpose of the present invention, it is assumed that a one-time 3-day random exploration is still a realistic setup, for instance during a weekend where occupants are traveling. However, the present invention is not to be limited in scope to using this exploration method or period as other methods or periods could potentially be used. Action effects are recorded in the form of {s, a, s′} tuples where s is a state, a is an action taken from s, and s′ is the next state transitioned into after taking action a in s. Note that since ea is part of the state in the definition of the MDP, the reward can be computed exactly by the agent at every given state. One advantage of fully random exploration is a quick coverage of larger portions of the state space and of different action sequences, which facilitates faster learning. A disadvantage of it is increased energy consumption, but this is outbalanced by the energy savings starting the fourth day throughout the year.
Starting the end of the third day, the agent plans and executes an energy saving set-back policy as discussed below in connection with
Referring to
In step 602, control unit 104 fits a regression to model a transition function for each of the possible actions using the data set of tuples.
Prior to discussing steps 601 and 602 in detail, a brief description of the agent executing an energy saving set-back policy is deemed appropriate. While doing so, the agent keeps recording the effects of its actions, fitting a regression model to the accumulated action-effect tuples once at every user configured period (e.g., every hour, at midnight). Based on the most recently learned model, the agent keeps executing an efficient lookahead policy to choose the next action. The main routine for action selection, called at every time step with the current state observation, is summarized in Algorithm 1 shown below. As discussed further below, there are two main subroutines of this algorithm, namely the agent's model-learning algorithm (LearnHouseModel) (step 602), and the agent's planning and action selection algorithm (TreeSearch) (step 603).
The agent learns the house characteristics in a routine named LearnHouseModel (step 602). LearnHouseModel fits regression models to the collected data-set of {s, a, s′} tuples, which are samples from the house's state transition function. The agent uses these tuples as labeled examples <s, a>→s′ for fitting a regression to model the transition function, separately for each of the four actions (a total of four regression runs). One part in learning the transition function is selecting what features to include as the regression's independent variables. In turn, this implies the features included in the state representation, used here as a main guideline:
Definition 1.
A state variable is the minimally dimensioned function of history that is necessary and sufficient to compute the decision function, the transition function, and the contribution (here the reward) function.
In what follows, the process by which the regression features is selected, and therefore the state, is described (step 601). Three features in the state that are used for computing the reward function are as follows: Tin, ea, and Time. For computing the transition function, features that help predict Tin and ea are needed (Time can be directly computed). For the ease of understanding and brevity, the process of selecting features for predicting Tin are only described, but the process for selecting features for predicting ea is conceptually similar and uses a subset of the state-variables needed for predicting Tin. For predicting Tin at the next time-step, an obvious feature that is included, besides Tin itself, is the outdoor temperature at the current time step, Tout, as it directly affects the heat-pump operation, and is easily measurable, similarly to Tin. A linear regression using only Tin and Tout for predicting Tin is tested. Note that during regression runs a constant 1 is added as a “bias” (regression-only) feature, to enable affine regression. To test the prediction's accuracy, data was generated by simulating one year of actions and recording the resulting 87,600 {s, a, s′} tuples, one for each 6-minute time-step during one year. The prediction error was then calculated in a cross-validation test which repeatedly chooses 70% of the data as a training set and the rest 30% of the data as a validation set and averages the results of multiple runs. The cross-validation's error measure is the mean-squared prediction error over the validation-set, but the related and more intuitive error measure of the standard deviation of the prediction errors, measured in ° F. (Fahrenheit) is reported herein.
Using only Tin and Tout the prediction error is unacceptably high: a standard deviation of more than 1° F. for a 6 minutes time-step. This means that over 1 hour, the standard deviation of the prediction error is 10° F., which makes it hard to plan actions several hours in advance. A main source of prediction error is a hidden state of the house and the environment, for example, the temperatures of the house's walls and furniture, that serve as heat capacitors, and causes actions to have delayed effect. While a realistic thermostat generally cannot measure this hidden part of the state, it could use observable quantities that affect or correlate with the hidden part of the state. Specifically, the previous action taken by the thermostat and a history of 10 measured indoor-temperatures are added as features. Adding the previous action as a feature results in 4×4=16 combinations for the recent pair of actions, and for each such combination a separate regression is run, in a total of 16, rather than 4, regressions. It is noted that if there are n actions, then n2 can be quite large. As a result, the “prev-action” feature may be omitted so that there will be just n regressions as opposed to n2 regressions. The 10 historic temperatures are added directly as regression features.
Adding features to the state representation helps in predicting Tin as a part of the transition function. But, being part of the state, these state features now need to be predicted as a part of the transition function. All but one of the added features are just forward recordings of past measurements, and can be directly computed from <s, a> without the need to predict them. The only one that needs to be predicted is Tout. However, Tout is different than Tin in that it is independent of the agent's actions and can be considered as an information state, a term that refers to the part of the state describing random processes external to the agent. The approach taken herein for predicting Tout is using a weather forecast that is assumed to be given by an external source. For instance, the agent can connect to a weather forecast agency using the Internet infrastructure in a realistic deployment. As the weather forecast is needed for predicting Tout which is already part of the state, based on Definition 1, the weather forecast is added to the state representation as a (multidimensional) state feature. As the weather forecast is given from an external source, it does not need to be predicted by itself from <s, a>, so no further features are needed in the state representation and the resulting state representation is the one defined above. For the purpose of simulation, a noisy weather forecast from the actual future weather data, given in the TMY2 file, is generated using the following rule. At a specific hour h, the forecast for i hours into the future, denoted as fh+i is defined (recursively) as:
where N(0, 0.5) is a normal random variable with μ=0 and σ=0.5. Note that the noisy forecast is computed at every time step until the end of the day, and therefore changes as time progresses. A histogram of the resulting forecast errors over one year, summarized together for forecast ranges of 6-17 hours into the future (these are the forecasts range needed during the don't-care period) is shown in
Recall that the agent uses the default thermostat strategy to keep the temperature in range outside the don't-care period, whenever occupants are at home. During the don't-care period, the agent plans and selects actions using the nightly learned model, with the goal of executing an effective set-back strategy that both saves energy and minimizes violations of the temperature comfort requirements. In MDP terms, the agent's goal is finding a policy that maximizes the long-term reward. In general, once the approximate transition (and/or reward) functions are learned, the agent can use them to approximate the optimal policy using either one of the following three methods, or a combination of them: value-function approximation, policy function approximation, or lookahead methods.
The principles of the present invention utilize a lookahead method as discussed below.
In step 603, control unit 104 determines the action to take using a lookahead planning approach (e.g., tree-based lookahead approach) using the result of LearnHouseModel from step 602 during the don't care period for every time-step within the don't care period until an end of the don't care period.
Upon determining the action to take using the lookahead planning approach during the don't care period for every time-step within the don't care period until an end of the don't care period, control unit 104 continues to record the effects of its actions in the form of tuples of data in step 604.
Upon recording the effects of selecting actions in terms of a data set of tuples, control unit 104 continues to select a model to fit a regression using regression features in step 601. Alternatively, control unit 104 may select a model only a single time. In which case, upon recording the effects of selecting actions in terms of a data set of tuples, control unit 104 continues to fit a regression to model a transition function for each of the possible actions using the data set of tuples in step 602.
Due to the dimensionality of the state representation, it might be computationally intensive, or even impractical to plan or approximate a value function over the whole state space. Assuming the agent has limited on-site computational resources, it needs an efficient way to plan its actions. Therefore, the agent uses an efficient tree-search lookahead that is limited to a specific class of policies. A lookahead search starts at some point during the don't-care period, and ends at the end of an episode, such as at midnight for example. As a result, the agent makes plans for time-ranges of 6-17 hours, using actions of 6-minute length. As predicted values at time t are used to estimate values at time t+1, predictions that are further into the future accumulates uncertainty and become more noisy. Therefore, an approach similar to Model-Predictive Control is taken, where the agent runs a lookahead search at a given time-step, uses the results of the search to determine only the next action to take, then runs a new search at the next time-step, and so on.
Algorithm 2 (shown further below) implements this lookahead search, selecting the next action to be the first action of the most promising path. Specifically, it initializes a priority queue (step 1) and retrieves the current weather forecast (step 2). Next, it iterates over every time-step i starting the current time until the end of the don't-care period (step 5). The simulate( ) function (steps 7, 9, 12) uses the model and the weather forecast to simulate a specific set-back policy, which applies one action from the current time-step until time-step i, and another action from time-step i until the end of the don't-care period. For instance, step 7 simulates applying off and then heat. Simulation continues from 6 pm until the end of episode at midnight, at this point simulating the default thermostat actions. Each simulation outputs the total accumulated reward along the simulated path, and the first action taken in this path (steps 7, 9, 12) which are then inserted into the priority queue as a key-value pair, where the total reward is the key and the returned action is the value (steps 8, 10, 13). Note that the first action could be either of the two simulated actions as initially i=m. The first action of the path that resulted in the highest reward is then selected for execution (step 15). The intuition behind the algorithm is to maximize the set-back time while still returning the temperature back to range by 6 pm, through an efficient search within a policy class that does exactly that. The reason step 9 is added, in which heat is simulated and then aux, is to account for cold days in which the heat-pump is not able to bring the temperature to the desired range. It is noted that the principles of the present invention are not to be limited in scope to the sequence of actions discussed above and may simulate other sequence of actions.
An important part of the simulate( ) function is handling the uncertainty in the long-term predictions of Tin. In general, regression models predict the expected transition for a specific <s, a> pair. However, actual values can be higher or lower, so that relying on expected transitions can result in overly optimistic behavior that applies a strong set-back, from which the heat-pump is eventually not able to recover the temperature back to range by 6 pm, thus violating the comfort requirements. To hedge against that, each prediction is augmented with a dynamic safety buffer that encourages risk-taking in safer situations and discourages risk-taking in less safe situations. Specifically, each prediction is augmented as follows. Let σ be the standard deviation of the regression model measured on the training-set. Let Tin′ be the expected temperature predicted by the regression model. Let Δtemp be the difference between the current temperature and the required temperature range at 6 pm, and Δtime be the number of minutes until 6 pm. Then simulate( ) uses an augmented prediction p defined as:
where c is a constant and where Δtemp and Δtime are normalized by dividing Δtemp by 15 (° F.) and Δtime by 11≠60=660 (minutes), and trimming their quotient to a [0, 1] range. The constant c determines the maximum number of standard deviations that could possibly augment a prediction, and was determined to be 1, by running a grid search to find the best performing parameter, over a 2,500 ft2 house using NYC weather data. The importance of using this dynamic safety buffer is demonstrated in the ablation analysis discussed further below. It is noted that the principles of the present invention are not to be limited in scope to the values of the constants discussed above, but are used for illustrative purposes.
In one embodiment, testing the agent's performance is started in a range of different house sizes and weathers, and continues with the ablation analysis, analyzing the contributions of the different agent components to the overall performance.
In one embodiment, to test the agent's performance, GridLAB-D was used to simulate different homes at different weather conditions over a 1-year period, where heat-pump HVAC system 101 is controlled by the agent (or application 304) of control unit 104. More specifically, 21 typical residential homes, of sizes ranging from 1,000 square feet (ft2) to 4,000 ft2, were simulated. These homes were simulated under different weather conditions using typical weather data that was recorded in different cities in the United States by the U.S. National Renewable Energy Laboratory, given in a TMY2 format. The comfort requirements are as described earlier, requiring an indoor temperature of 69-75° F. from 6 pm to 7 am, with a don't-care period of 7 am-6 pm. In one embodiment, a comparison is made between the strategy of the present invention with the default thermostat strategy that is used in real deployments, where the thermostat is always on (recall that setting the thermostat back during the don't-care period when using the default strategy actually increases energy consumption).
Heat-pump systems are gaining increased popularity as a part of the effort to move society to sustainable energy consumption. While setting back the temperature is an effective energy saving strategy in other HVAC systems, the common practice is to avoid setting back heat-pump systems as when used with existing control strategies it actually increases energy consumption. As discussed herein, the principles of the present invention involve designing and implementing a complete reinforcement learning agent that learns an effective set-back strategy, which lead to roughly 7.0%-14.5% of yearly energy savings in a realistic simulation of different house sizes and weather conditions. The agent of the present invention is adaptive in the sense that when it is deployed in a new house, it learns the house properties and efficiently plans and executes a set-back strategy, which both saves energy starting the fourth day, and minimizes violations of the temperature comfort constraints.
The descriptions of the various embodiments of the present invention have been presented for purposes of illustration, but are not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein was chosen to best explain the principles of the embodiments, the practical application or technical improvement over technologies found in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.
This application is related to the following commonly owned co-pending U.S. patent application: Provisional Application Ser. No. 61/988,382, “A Learning Agent for HVAC Thermostat Control,” filed May 5, 2014, and claims the benefit of its earlier filing date under 35 U.S.C. §119(e).
This invention was made with government support under Grant Nos. IIS-0917122 awarded by National Science Foundation, 61-2075UT awarded by National Science Foundation, CNS-1305287 awarded by National Science Foundation, CNS-1330072 awarded by National Science Foundation, 21C184-01 awarded by the Office of Naval Research; N000014-09-1-0658 awarded by the Office of Naval Research; FA8750-14-1-0070 awarded by U.S. Air Force Research Laboratory and DTFH61-07-H-00030 awarded by the Federal Highway Administration. The U.S. government has certain rights in the invention.
Number | Date | Country | |
---|---|---|---|
61988382 | May 2014 | US |