Disclosed are embodiments related to improving cavity filter tuning using imitation and reinforcement learning.
Cavity filters are mechanical filters that are commonly used in 4G and 5G radio base stations. There is a great demand for such cavity filters, e.g. given the growing trend of the internet of things and the connected society. During the production process of cavity filters, there are always physical deviations in the cavities and cross couplings of the filter, which requires the filter to be tuned manually to make the magnitude responses of the scattering parameters fit some specifications. This manual tuning requires an expert's experience and intuition to adjust the screw positions on the filter and is therefore costly and time consuming, and also prevents the manufacturing process from being fully automated.
Reinforcement learning is a technique to solve sequential decision-making problems. It models the problem into a Markov decision process (MDP) where an agent interacts with an environment to receive (state, reward) and acts back to achieve high accumulative long-term rewards. Deep reinforcement learning with deep neural networks as a function approximator has recently successfully dealt with learning how to play Atari games on a human level, beating human masters at the game of Go and even showed some promise in use for tuning of cavity filters.
Imitation learning is a powerful and practical alternative to reinforcement learning for learning sequential decision-making policies using demonstrations. Imitation learning learns how to make sequences of decisions in an environment, where the training signal comes from demonstrations. Imitation learning has been widely used in robotics and auto-driving.
While imitation learning is useful in many circumstances (in particular, it is far more sample efficient than Reinforcement Learning), it has the obvious drawback of being unable to outperform its “parent” (expert) policy. Thus, any imperfections of the parent are carried over to the child. Reinforcement Learning has no such limitations, but it is extremely sample inefficient. By utilizing imitation learning as an initialization for a Reinforcement Learning (RL)-technique it should, in principle, be possible to combine the best of both, or at least create a technique which can outperform the parent policy faster than any reinforcement learning technique.
Some attempts at automating cavity filter tuning have been made, though each such attempt has had deficiencies. For example, systems may only tune the cavity filter to satisfy the S11 parameters (return loss) without regard for the other Scattering (S−) parameters. One system has used neural networks to determine how to turn the screws of a cavity filter, by manually tuning a filter and then learning the deviations in screw positions of all screws in the filter as a function of the S-parameters. However, the system only considered return loss requirements and only predicted deviations of the frequency screws, assuming the coupling and cross-coupling screws were already well-tuned.
Embodiments disclosed herein model filter tuning with an imitation and reinforcement learning technique, which first performs imitation learning iterations with data from one well-trained expert filter tuning model. Then the weights of the trained imitation policy are used in a policy gradient reinforcement learning method which gives output with action of all screws being tuned in each step. Finally, a screw selector is trained using reinforcement learning to allow only one screw to be tuned at a time.
Embodiments have several advantages. For example, the performance of the imitation and reinforcement learning agent is better than a well-trained expert model as it uses expert policy as the initial policy. Thus, it can outperform a well-trained expert model with a higher tuning success rate and fewer adjustment steps which leads to shorter total tuning time. Additionally, the imitation and reinforcement learning based cavity filter tuning model of embodiments has been applied in a simulation environment and could tune cavity filters with more screws and satisfy both S11 and S21 parameters (return loss and insertion loss) and tuned both coupling and cross-coupling, improving upon prior art solutions.
According to a first aspect, a method for solving a sequential decision-making problem is provided. The method includes gathering state-action pair data from an expert policy. The method further includes applying imitation learning to yield a cloned policy based on the gathered state-action pair data from the expert policy. The method further includes applying a reinforcement learning technique, wherein the reinforcement learning technique is initialized based on the cloned policy and has an output with one or more action to be performed for solving the sequential decision-making problem.
In some embodiments, the imitation learning comprises a behavioral cloning technique. In some embodiments, the sequential decision-making problem for solving comprises cavity filter tuning and the method and the method further includes applying a screw selector for tuning a screw in a cavity filter. In some embodiments, the screw selector comprises a Deep Q Network (DQN). In some embodiments, the expert policy is based on Tuning Guide Program (TGP). In some embodiments, the cloned policy is in the form of a neural network, wherein the deepest hidden layer is convolutional in one dimension.
In some embodiments, the reinforcement learning technique comprises the Deep Deterministic Policy Gradient (DDPG) technique. In some embodiments, an output of the reinforcement learning technique is forced via a multiplied tanh function. In some embodiments, applying the reinforcement learning technique comprises allowing the reinforcement learning technique to run for Ncritic iterations where only the critic network is trained, with no change to the actor network or target network, and after the Ncritic iterations, allowing the technique to run to convergence. In some embodiments, the method further includes performing the one or more actions of the output of the reinforcement learning technique
According to a second aspect, a node for solving sequential decision-making problems is provided. The node includes a data storage system. The node further includes a data processing apparatus comprising a processor. The data processing apparatus is coupled to the data storage system, and the data processing apparatus is configured to gather state-action pair data from an expert policy. The data processing apparatus is further configured to apply imitation learning to yield a cloned policy based on the gathered state-action pair data from the expert policy. The data processing apparatus is further configured to apply a reinforcement learning technique, wherein the reinforcement learning technique is initialized based on the cloned policy.
According to a third aspect, a node for solving sequential decision-making problems is provided. The node includes a gathering unit configured to gather state-action pair data from an expert policy. The node further includes an imitation learning unit configured to apply imitation learning to yield a cloned policy based on the gathered state-action pair data from the expert policy. The node further includes a reinforcement learning unit configured to apply a reinforcement learning technique, wherein the reinforcement learning technique is initialized based on the cloned policy.
According to a fourth aspect, a computer program is provided comprising instructions which when executed by processing circuitry of a node causes the node to perform the method of any one of the embodiments of the first aspect.
According to a fifth aspect, a carrier is provided containing the computer program of the fourth aspect, wherein the carrier is one of an electronic signal, an optical signal, a radio signal, and a computer readable storage medium.
The accompanying drawings, which are incorporated herein and form part of the specification, illustrate various embodiments.
An example of an intelligent filter tuning technique using a common reinforcement learning technique follows. Filter tuning as an MDP can be described as follows.
State: The S-parameters are the state. The S-parameters are frequency dependent, i.e. S=S(f). For a two-ports filter we have S-parameters S11; S12; S21; S22. The S-parameters may be the output of a Vector Network Analyzer, which displays S-parameter curves. The input of the observations to the artificial neural networks (ANNs) of the policy function and the Q-network for a single observation may be a real-valued vector including the real and imaginary parts of all the components of the S-parameters. Every MHz in a range between 850 and 950 MHz was sampled and attended to a vector with 400 elements.
Action: Tuning the cavity filter. For example, a 6p2z type filter has 13 adjustable screws each with a continuous range [−90°; 90° ]. One or more of the screws may be adjusted for tuning purposes.
Reward: Agent will receive a positive reward (e.g. +100 reward) if the state satisfies the design specification, otherwise, a negative reward is incurred depending on the distance to the tuning specifications. This shaped reward function may be heuristically designed by human intuition and does not necessarily lead to an optimal policy for problem solving. An example follows:
Here s11spec(f) and s21spec(f) are the lower or upper bound of the design specifications. Then the total reward for a state s becomes:
The reinforcement learning technique used may be the Deep Deterministic Policy Gradient (DDPG) technique. Simulation results using the DDPG technique show that the agent could find a good policy after sampling about 149,000 data points with the best available hyper-parameters.
Tuning Guide Program (TGP) is one prominent example of an automatic tuning technique. By calculating the return loss curve which best matches a Chebyshev polynomial within the passband, within the feasible set of the current filter model, TGP can calculate the optimal positions of the screws and thereby provide recommendations for how to tune each screw. As the true filter may not match the model, TGP updates its estimate of the feasible set in each iteration until the filter is tuned.
TGP is (as of the time of writing) state-of-the-art on the problem of automatic cavity filter tuning. On a 6p2z environment, for example, TGP is able to tune filters with an accuracy of 97% and, on average 27 screw adjustments. The accuracy, in this case, refers to the probability that the filter will be tuned within 100 adjustments when initialized randomly. Embodiments disclosed herein build upon learning from expert data, such as that gathered by running TGP. Accordingly, embodiments herein provide solutions to the following two problems: (1) With as few data points as possible, how to ensure that the trained policy has a significantly better accuracy than the expert data (e.g. TGP); and (2) With as few data points as possible, how to ensure that the trained policy, on average, uses significantly fewer screw adjustments than the expert data (e.g. TGP), while maintaining the same or substantially similar accuracy.
In order to address the two issues identified above, embodiments herein provide an imitation-reinforcement learning technique, such as detailed below.
As a first step, state-action pair data is gathered with an expert policy (such as provided by TGP). An expert policy refers to a known policy which is desired to be improved, such as a policy where actions are chosen by a source of expert knowledge (e.g., a human expert that manually selects actions), or a policy that is known to have decent performance (e.g., TGP in the case of tuning cavity filters). After this, behavioral cloning may be performed on the expert policy, yielding a cloned policy. The expert policy and/or cloned policy may take the form of a neural network, where the deepest hidden layer is convolutional in one dimension. Convolutional layers in a neural network convolve (e.g., with a multiplication or other dot product) the input and pass its result to the next layer.
In order to improve the performance on the policy obtained with imitation learning, a reinforcement learning technique is employed. The reinforcement learning technique may employ an actor-critic network, i.e. an actor neural network and a critic neural network. An actor-critic network (such as DDPG), utilizes an actor network and a critic network, where the actor (neural) network is used to select actions, and the critic (neural) network is used to criticize the actions made by the actor, where the criticism by the critic network iteratively improves the policy of the actor network. A target network may also be used, which is similar to the actor network and initialized to the actor network, but is updated more slowly than the actor network, in order to improve convergence speed. In embodiments, the DDPG technique may be used, where an actor network is initialized with the weights of an imitation policy, as trained in the previous steps. To maintain consistency with an imitator network, the output may be forced (e.g., via a multiplied tanh function) to be within the interval [−ba, ba]. In order to have a well-initialized critic network, the reinforcement learning technique (e.g., DDPG) may be allowed to run for Ncritic iterations where only the critic network was trained, with no change to the actor network or target network. After this, the technique is allowed to run to convergence.
In some embodiments, a screw selector (such as one using a Deep Q Network (DQN)) may be used. For example, when using DDPG, it can necessitate that all screws must be turned in every step to converge. This property is suboptimal for minimizing or reducing the number of adjustments needed. A screw selector may be trained (e.g. using DQN), to allow the technique to tune only one screw at a time. In embodiments, anywhere from one screw to all the screws may be adjusted on a given step.
For example, the screw selector may be trained in the following manner. In every step, S-parameter data is gathered and a trained reinforcement learning actor network (for instance the one from the steps above), predicts an action to be performed for every screw. Both of these (the S-parameter data and the action for every screw) are fed into a fully connected neural network, which predicts Q-values (a cumulative reward value, short for Quality Value) for each screw. When trained, the agent then tunes the screw with the highest predicted Q-value with the amount predicted by the DDPG actor network for that particular screw. The Q-network (part of the Deep Q Network (DQN) technique) is trained using DQN with an s-decay exploration scheme.
The table below shows the performance of different tuning techniques for 6p2z filter. TGP refers to the expert data mentioned above. DDPG (only) refers to using only reinforcement learning using the DDPG technique. IL-DDPG (without DQN) refers to using imitation learning and reinforcement learning (using the DDPG technique). Finally, IL-DDPG-DQN refers to using imitation learning and reinforcement learning (using the DDPG technique), and additionally using a screw selector (using the DQN technique). The IL-DDPG-DQN combination has a higher success rate and fewer adjustment steps (on average), which leads to shorter total tuning time.
Step s402 comprises gathering state-action pair data from an expert policy.
Step s404 comprises applying imitation learning to yield a cloned policy based on the gathered state-action pair data from the expert policy.
Step s406 comprises applying a reinforcement learning technique, wherein the reinforcement learning technique is initialized based on the cloned policy and has an output with one or more action to be performed for solving the sequential decision-making problem.
In embodiments, the imitation learning comprises a behavioral cloning technique. In embodiments, the method further includes applying a screw selector for tuning a screw in a cavity filter, such as a screw selector comprising a Deep Q Network (DQN). In embodiments, the expert policy is based on Tuning Guide Program (TGP). In embodiments, the cloned policy is in the form of a neural network, wherein the deepest hidden layer is convolutional in one dimension. In embodiments, the reinforcement learning technique comprises the Deep Deterministic Policy Gradient (DDPG) technique. In embodiments, an output of the reinforcement learning technique is forced via a multiplied tanh function. In embodiments, applying the reinforcement learning technique comprises allowing the reinforcement learning technique to run for Ncritic iterations where only the critic network is trained, with no change to the actor network or target network, and after the Ncritic iterations, allowing the technique to run to convergence. In embodiments, the method further includes performing the one or more actions of the output of the reinforcement learning technique
While various embodiments of the present disclosure are described herein, it should be understood that they have been presented by way of example only, and not limitation. Thus, the breadth and scope of the present disclosure should not be limited by any of the above-described exemplary embodiments. Moreover, any combination of the above-described elements in all possible variations thereof is encompassed by the disclosure unless otherwise indicated herein or otherwise clearly contradicted by context.
Additionally, while the processes described above and illustrated in the drawings are shown as a sequence of steps, this was done solely for the sake of illustration. Accordingly, it is contemplated that some steps may be added, some steps may be omitted, the order of the steps may be re-arranged, and some steps may be performed in parallel.
This application is a 35 U.S.C. § 371 National Phase Entry application from PCT/SE2020/050534, filed May 27, 2020, designating the United States, which claims priority to U.S. provisional patent application No. 62/853,403, filed May 28, 2019, the disclosures of which are incorporated herein by reference in their entirety.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/SE2020/050534 | 5/27/2020 | WO |
Number | Date | Country | |
---|---|---|---|
62853403 | May 2019 | US |