This invention relates to a learning device, a learning method, and a learning program that performs inverse reinforcement learning.
Reinforcement Learning (RL) is known as one of the machine learning methods. Reinforcement learning is a method to learn an action that maximizes value through trial and error of various actions. In reinforcement learning, a reward function is set to evaluate this value, and the action that maximizes this reward function is explored. However, setting the reward function is generally difficult.
Inverse reinforcement learning (IRL) is known as a method to facilitate the setting of this reward function. In inverse reinforcement learning, the decision-making history data of an expert is used to generate a reward function that reflects the intention of the expert by repeating optimization using the reward function and updating the parameters of the reward function.
Non patent literature 1 describes Maximum Entropy Inverse Reinforcement Learning (ME-IRL), which is a type of inverse reinforcement learning. In ME-IRL, the maximum entropy principle is used to specify the distribution of trajectories and learn the reward function by approaching the true distribution (i.e., maximum likelihood estimation). This solves the indefiniteness of the existence of multiple reward functions that reproduce the trajectory (action history) of an expert.
Non patent literature 2 also describes Guided Cost Learning (GCL), a method of inverse reinforcement learning that improves on maximum entropy inverse reinforcement learning. The method described in Non patent literature 2 uses weighted sampling to update the weights of the reward function.
On the other hand, in the ME-IRL described in Non patent literature 1, it is necessary to calculate the sum of rewards for all possible trajectories during training. However, in reality, it is difficult to calculate the sum of rewards for all trajectories.
To address this issue, the GCL described in Non patent literature 2 calculates this value approximately by weighted sampling. Here, when using weighted sampling with GCL, it is necessary to assume the distribution of the sampling itself. However, there are some problems, such as combinatorial optimization problems, where it is not known how to set the sampling distribution, so the method described in Non patent literature 2 is not applicable to various mathematical optimization.
Therefore, it is an exemplary object of the present invention to provide a learning device, a learning method, and a learning program that can perform inverse reinforcement learning applicable to a mathematical optimization problem such as combinatorial optimization, while solving a problem of indefiniteness in inverse reinforcement learning.
A learning device according to the present invention includes: a function input means which accepts input of a reward function whose feature is set to satisfy Lipschitz continuity condition; an estimation means which estimates a trajectory that minimizes Wasserstein distance, which represents distance between a probability distribution of a trajectory of an expert and a probability distribution of a trajectory determined based on a parameter of the reward function; and an updating means which updates, based on the estimated trajectory, the parameter of the reward function to maximize the log-likelihood of Boltzmann distribution derived from a principle of a maximum entropy, wherein the updating means derives, as a lower limit of the log-likelihood, an expression for subtracting, from the Wasserstein distance, an entropy regularization term defined by an expression for the maximum reward value for the parameter minus the average value of reward for the parameter, and updates the parameter of the reward function to maximize the derived lower limit of the log-likelihood.
A earning method according to the present invention includes: accepting input of a reward function whose feature is set to satisfy Lipschitz continuity condition; estimating a trajectory that minimizes Wasserstein distance, which represents distance between a probability distribution of a trajectory of an expert and a probability distribution of a trajectory determined based on a parameter of the reward function; and updating, based on the estimated trajectory, the parameter of the reward function to maximize the log-likelihood of Boltzmann distribution derived from a principle of a maximum entropy, wherein, when updating the parameter, the computer derives, as a lower limit of the log-likelihood, an expression for subtracting, from the Wasserstein distance, an entropy regularization term defined by an expression for the maximum reward value for the parameter minus the average value of reward for the parameter, and updates the parameter of the reward function to maximize the derived lower limit of the log-likelihood.
A learning program according to the present invention causes a computer to execute; function input processing to accept input of a reward function whose feature is set to satisfy Lipschitz continuity condition; estimation input processing to estimate a trajectory that minimizes Wasserstein distance, which represents distance between a probability distribution of a trajectory of an expert and a probability distribution of a trajectory determined based on a parameter of the reward function; and updating processing to update, based on the estimated trajectory, the parameter of the reward function to maximize the log-likelihood of Boltzmann distribution derived from a principle of a maximum entropy, wherein in the updating processing, as a lower limit of the log-likelihood, an expression for subtracting, from the Wasserstein distance, an entropy regularization term defined by an expression for the maximum reward value for the parameter minus the average value of reward for the parameter is derived, and the parameter of the reward function to maximize the derived lower limit of the log-likelihood is updated.
The present invention is capable of performing inverse reinforcement learning applicable to a mathematical optimization problem such as combinatorial optimization, while solving a problem of indefiniteness in inverse reinforcement learning.
For ease of understanding, the problem setting, methodology, and issues of maximum entropy inverse reinforcement learning, which is assumed in this exemplary embodiment, is described. In ME-IRL, the following problem setting is assumed. That is, the setting is to estimate only one reward function R(s, a)=θ·f(s, a) from expert's data D={τ1, τ2, τN} (where τ1=((s1, a1), (s2, a2), . . . , (SN, aN)). In ME-IRL, estimating θ can reproduce the decision-making of an expert.
Next, the ME-IRL methodology is described. In ME-IRL, a trajectory—is represented by Equation 1, illustrated below, and a probability model representing distribution of trajectories pc (τ) is represented by Equation 2, illustrated below. θTfτ in Equation 2 represents the reward function (see Equation 3). Also, Z represents the sum of rewards for all trajectories (see Equation 4).
The update law of the reward function weights by maximum likelihood estimation (specifically, the gradient ascent method) is then represented by Equations 5 and 6, which are illustrated below. In Equation 5, a is step width, and L(0) is distance measure between distributions used in ME-IRL.
The second term in Equation 6 is the sum of rewards for all trajectories. ME-IRL assumes that the value of this second term can be calculated exactly. However, in reality, it is difficult to calculate the sum of rewards for all trajectories. The above is the problem setting, methodology, and issues of ME-IRL.
The exemplary embodiment will be described below with reference to the drawings.
Since the mathematical optimization execution unit 50, the weight updating unit 60, and the convergence decision unit 70 perform the inverse reinforcement learning described below, the device including the mathematical optimization execution unit 50, the weight updating unit 60, and the convergence decision unit 70 can be called an inverse reinforcement learning device.
The storage unit 10 stores information necessary for the learning device 100 to perform various processes. The storage unit 10 may store decision-making history data (trajectory) of an expert that is accepted by the input unit 20 described below. The storage unit 10 may also store candidate features of the reward function to be used for learning by the mathematical optimization execution unit 50 and the weight updating unit 60, which will be described later. However, the candidate feature need not necessarily be the feature used for the objective function.
The storage unit 10 may also store a mathematical optimization solver to realize the mathematical optimization execution unit 50 described below. The content of the mathematical optimization solver is arbitrary and may be determined according to the environment or device in which it is to be executed.
The input unit 20 accepts input of information necessary for the learning device 100 to perform various processes. For example, the input unit 20 may accept input of the decision-making history data of an expert (specifically, state and action pairs) described above. The input unit 20 may also accept input of initial states and constraints used by the inverse reinforcement learning device to perform inverse reinforcement learning, as described below.
The feature setting unit 30 sets features of the reward function from data including state and action. Specifically, in order for inverse reinforcement learning device described below to be able to use Wasserstein distance as a distance measure between distributions, the feature setting unit 30 sets the features of the reward function so that the gradient of the tangent line is finite for the entire function. The feature setting unit 30 may, for example, set the features of the reward function to satisfy the Lipschitz continuity condition.
For example, let fτ be a feature vector of trajectory i. In the linear case of the reward function θTfτ, if the mapping F: τ→fτ is Lipschitz continuous, then θTfτ is also Lipschitz continuous. Therefore, the feature setting unit 30 may set the features so that the reward function is a linear function.
For example, Equation 7, illustrated below, is an inappropriate reward function for this disclosure because the gradient becomes infinite at a0.
The feature setting unit 30 may, for example, determine the reward function with features set according to user instructions, or may retrieve a reward function that satisfies the Lipschitz continuity condition from the storage unit 10.
The weight initial value setting unit 40 initializes the weights of the reward function. Specifically, the weight initial value setting unit 40 sets the weights of individual features included in the reward function. The method of initializing the weights is not particularly limited, and the weights may be initialized based on any predetermined method according to the user or other factors.
The mathematical optimization execution unit 50 derives a trajectory τ (where τ is the superscript {circumflex over ( )} of τ) that minimizes distance between the probability distribution of the expert's trajectory (action history) and the probability distribution of the trajectory determined based on the optimized (reward function) parameters. Specifically, the mathematical optimization execution unit 50 estimates the expert's trajectory τ by using the Wasserstein distance as the distance measure between the distributions and executing mathematical optimization to minimize the Wasserstein distance.
The Wasserstein distance is defined by Equation 8, illustrated below. In other words, the Wasserstein distance represents the distance between the probability distribution of the expert's trajectories and the probability distribution of trajectories determined based on the parameters of the reward function. Note that due to the constraint of the Wasserstein distance, the reward function θTfτ must be a function that satisfies the Lipschitz continuity condition. On the other hand, in this exemplary embodiment, since the features of the reward function are set to satisfy the Lipschitz continuity condition by the feature setting unit 30, the mathematical optimization execution unit 50 is able to use the Wasserstein distance as illustrated below.
The Wasserstein distance defined in Equation 8, illustrated above, takes values less than or equal to zero, and increasing this value corresponds to bringing the distributions closer together. In the second term of Equation 8, τθ(n) represents the n-th trajectory optimized by the parameter θ. The second term in Equation 8 is a term that can also be calculated in a combinatorial optimization problem. Therefore, by using the Wasserstein distance illustrated in Equation 8 as a distance measure between distributions, inverse reinforcement learning applicable to mathematical optimization problems such as the combinatorial optimization problems can be performed.
The weight updating unit 60 updates the parameter θ of the reward function to maximize the distance measure between distributions based on the estimated expert's trajectory τ. Here, in maximum entropy inverse reinforcement learning (i.e., ME-IRL), the trajectory τ is assumed to follow a Boltzmann distribution by the maximum entropy principle. Therefore, as in ME-IRE, the weight updating unit 60 updates the parameter θ of the reward function to maximize the log-likelihood of the Boltzmann distribution derived by the maximum entropy principle based on the estimated expert's trajectory τ as illustrated in Equations 5 and 6 above.
In updating, the weight updater 60 in this exemplary embodiment derives the upper limit of the log sum exponential (hereinafter referred to as logSumExp) from the second term in Equation 6 (i.e., the sum of the rewards for all trajectories). In other words, the weight updating unit 60 derives the lower limit L_(θ) (L_denotes the subscript underbar of L) in the distance measure between the distributions used in ME-IRL as in Equation 9 below. The derived equation is sometimes hereafter referred to simply as the lower limit of the log-likelihood.
The second term in Equation 9, which represents the lower bound of the log-likelihood, is the maximum reward value for the current parameter θ, and the third term is the log value (logarithmic value) of the number of trajectories (Nτ) that can be taken. Thus, based on the log-likelihood of ME-IRL, the weight updating unit 60 derives the lower bound of the log-likelihood, which is calculated by subtracting the maximum reward value for the current parameter θ and the log value (logarithmic value) of the number of trajectories (Nτ) that can be taken from the probability distribution of trajectories.
In addition, the weight updating unit 60 transforms the equation for the lower bound of the derived ME-IRL log-likelihood into an equation that subtracts the entropy regularization term from the Wasserstein distance. An equation obtained by decomposing the expression for the lower bound of the log-likelihood of ME-IRL into the Wasserstein distance and the entropy regularization term is expressed as Equation 10 illustrated below.
The expression in the first parenthesis in Equation 10 represents the Wasserstein distance, as in Equation 8 above. The expression in the second parenthesis in Equation 10 represents the entropy regularization term that contributes to the increase in the log-likelihood of the Boltzmann distribution derived from the maximum entropy principle. Specifically, in the entropy regularization term illustrated in Equation 10 (i.e., the equation in the second parenthesis in Equation 10), the first term represents the maximum reward value for the current parameter θ, and the second term represents the average value of the reward for the current parameter θ.
Why this second term functions as an entropy regularization term is explained. In order to maximize the lower bound of the log-likelihood of the ME-IRL, the value of the second term must be smaller, which corresponds to a smaller difference between the maximum reward value and the average value. A smaller difference between the maximum reward value and the average value indicates a smaller variation in the trajectory.
In other words, a smaller difference between the maximum reward value and the average value means an increase in entropy, which means that entropy regularization works and contributes to entropy maximization. This contributes to maximizing the log-likelihood of the Boltzmann distribution, which in turn contributes to resolving indeterminacy in inverse reinforcement learning.
The weight updating unit 60 updates the parameter θ using the gradient ascent method, fixing, for example, the estimated trajectory τ based on Equation 10 illustrated above. However, the value may not converge with the usual gradient ascent method. In the entropy regularization term, the feature of the trajectory that takes the maximum reward value (fτθmax) does not match the average value of the feature of the other trajectories (fτ(n)) (i.e., the difference between them is not zero). Therefore, the usual gradient ascent method is not stable because the log-likelihood oscillates and does not converge, making it difficult to make a proper convergence decision (see Equation 11 below).
Therefore, when using the gradient method, the weight updating unit 60 in this exemplary embodiment may update the parameter θ so that the portion contributing to entropy regularization (i.e., the portion corresponding to the entropy regularization term) is gradually attenuated. Specifically, the weight updating unit 60 defines an updating equation in which the portion contributing to entropy regularization has an attenuation factor βt that indicates the degree of attenuation. For example, the weight updating unit 60 differentiates the above Equation 10 by θ and defines Equation 12, illustrated below, in which the attenuation coefficient is set in the portion corresponding to the entropy regularization term among the portion corresponding to the term indicating the Wasserstein distance (i.e., the portion contributing to the process of increasing the Wasserstein distance) and the portion corresponding to the entropy regularization term.
The attenuation coefficients are predefined according to the method of attenuating the portion corresponding to the entropy regularization term. For example, for smooth attenuation, βt is defined as in Equation 13, illustrated below.
In Equation β1 is set to 1 and β2 is set to 0 or greater. Also, t indicates the number of iterations. This makes the attenuation coefficient βt act as a coefficient that decreases the portion corresponding to the entropy regularization term as the number of iterations t increases.
Since the Wasserstein distance is more weakly phased than the log-likelihood, which is the KL divergence, it is possible to bring the log-likelihood close to 0, thereby also bringing the Wasserstein distance close to 0. Therefore, the weight updating unit 60 may update the parameter θ without attenuating the portion corresponding to the entropy regularization term in the initial stage of the update, and update the parameter θ to reduce the effect of the portion corresponding to the entropy regularization term at the timing when the log-likelihood begins to oscillate.
Specifically, the weight updating unit 60 updates the parameter θ with the attenuation coefficient βt=1 initially, using Equation 12 illustrated above. The weight updating unit 60 may then update the parameter θ by changing the attenuation coefficient βt=θ at the timing when the log-likelihood begins to oscillate, thereby eliminating the effect of the portion corresponding to the entropy regularization term.
For example, the weight updating unit 60 may determine that the log-likelihood has begun to oscillate when the moving average of the log-likelihood becomes constant. Specifically, the weight updating unit 60 may determine that the moving average has become constant when the change in the moving average in the time window (several points in the past from the current value) of the “lower bound of log-likelihood” is very small (e.g., less than le−3).
At the timing when the log-likelihood begins to oscillate, the weight updating unit 60 may first change the oscillation coefficient as illustrated above in Equation 13, instead of suddenly setting the attenuation coefficient βt=0. Then, the weight updating unit 60 may change the attenuation coefficient βt=0 at the timing when the log-likelihood begins to oscillate further after the change. The method for determining the timing at which the oscillations begin to occur is the same as the method described above.
Furthermore, the weight updating unit 60 may change the updating method of the parameter θ at the timing when the log-likelihood begins to oscillate further after the oscillation coefficient is changed as illustrated in Equation 13 above. Specifically, the weight updating unit 60 may update the parameter θ using the momentum method as illustrated in Equation 14 below. The values of γ1 and α in Equation 14 are predetermined. For example, γ1=0.9 and α=0.001 may be defined,
Thereafter, the trajectory estimation process by the mathematical optimization execution unit 50 and the updating process of the parameter θ by the weight updating unit 60 are repeated until the lower bound of the log-likelihood is judged to have converged by the convergence decision unit 70 described below.
The convergence decision unit 70 determines whether the distance measure between distributions has converged. Specifically, the convergence decision unit 70 determines whether the lower limit of the log-likelihood has converged. The determination method is arbitrary. For example, the convergence decision unit 70 may determine that the distance measure between distributions has converged when the absolute value of the lower limit of the log-likelihood becomes smaller than a predetermined threshold value.
When the convergence decision unit 70 determines that the distance measures between distributions have not converged, the convergence decision unit 70 continues the processing by the mathematical optimization execution unit 50 and the weight updating unit 60. On the other hand, when the convergence decision unit 70 determines that the distance measures between distributions have converged, the convergence decision unit 70 terminates the processing by the mathematical optimization execution unit 50 and the weight updating unit 60.
The output unit 80 outputs the learned reward function.
The input unit 20, the feature setting unit 30, the weight initial value setting unit 40, the mathematical optimization execution unit 50, the weight updating unit 60, the convergence decision unit 70, and the output unit 80 are realized by a processor (for example, CPU (Central Processing Unit)) of a computer that operates according to a program (learning program).
For example, a program may be stored in a storage unit 10 provided by the learning device 100, and the processor may read the program and operate as the input unit 20, the feature setting unit 30, the weight initial value setting unit 40, the mathematical optimization execution unit 50, the weight updating unit 60, the convergence decision unit 70, and the output unit 80 according to the program. In addition, the functions of the learning device 100 may be provided in the form of SaaS (Software as a Service).
The input unit 20, the feature setting unit 30, the weight initial value setting unit 40, the mathematical optimization execution unit 50, the weight updating unit 60, the convergence decision unit 70, and the output unit 80 may each be realized by dedicated hardware. Some or all of the components of each device may be realized by general-purpose or dedicated circuit, a processor, or combinations thereof. These may be configured by a single chip or by multiple chips connected through a bus. Some or all of the components of each device may be realized by a combination of the above-mentioned circuit, etc., and a program.
When some or all of the components of the learning device 100 are realized by multiple information processing devices, circuits, etc., the multiple information processing devices, circuits, etc. may be centrally located or distributed. For example, the information processing devices, circuits, etc. may be realized as a client-server system, a cloud computing system, etc., each of which is connected through a communication network.
Next, the operation example of this exemplary embodiment of the learning device 100 will be described.
The mathematical optimization execution unit 50 accepts input of the reward function whose feature is set to satisfy the Lipschitz continuity condition (step S14). Then, the mathematical optimization execution unit 50 executes mathematical optimization to minimize Wasserstein distance (step S15). Specifically, the mathematical optimization execution unit 50 estimates the trajectory that minimizes the Wasserstein distance, which represents the distance between a probability distribution of a trajectory of the expert and a probability distribution of a trajectory determined based on the parameter of the reward function.
The weight updating unit 60 updates the parameter of the reward function so as to maximize the log-likelihood of Boltzmann distribution based on the estimated trajectory (step S16). In this case, the weight updating unit 60 derives a lower bound of the log-likelihood and updates the parameter of the reward function so as to maximize the derived lower bound of the log-likelihood.
The convergence decision unit 70 determines whether the lower bound of the log-likelihood has converged or not (Step S17). If it is determined that the lower bound of the log-likelihood has not converged (No in step S17), the process from step S15 is repeated using the updated trajectory. On the other hand, if it is determined that the lower bound of the log-likelihood has converged (Yes in step S17), the output unit 80 outputs the learned reward function (step S18).
As described above, in this exemplary embodiment, the mathematical optimization execution unit 50 accepts input of a reward function whose feature is set to satisfy Lipschitz continuity condition and estimates a trajectory that minimizes Wasserstein distance, which represents distance between a probability distribution of a trajectory of an expert and a probability distribution of a trajectory determined based on a parameter of the reward function. Then, the weight updating unit 60 updates the parameter of the reward function to maximize the log-likelihood of Boltzmann distribution based on the estimated trajectory. Specifically, the weight updating unit 60 derives an expression that subtracts the entropy regularization term from the Wasserstein distance as a lower bound of the log-likelihood, and updates the parameter of the reward function so that the derived lower bound of the log-likelihood is maximized. Thus, while solving a problem of indefiniteness in inverse reinforcement learning, inverse reinforcement learning can be applied to mathematical optimization problems such as combinatorial optimization.
For example, the maximum entropy inverse reinforcement learning solves the indefiniteness of the existence of multiple reward functions, but only in situations where all trajectories can be calculated can adequate results be obtained. In contrast, the method of sampling trajectories leaves the difficulty of having to set up a sampling distribution. Combinatorial optimization, an optimization problem, takes discrete values (in other words, values that are not continuous), making it difficult to set up a probability distribution that returns the probability corresponding to a value when a certain value is input. This is because in a combinatorial optimization problem, if the value in the objective function changes even slightly, the result may also change significantly.
On the other hand, the learning device 100 (weight updating unit 60) of this exemplary embodiment derives the lower bound of the log-likelihood for maximum entropy inverse reinforcement learning, which is decomposed into Wasserstein distance and entropy regularization terms. The learning device 100 then updates the parameters of the reward function to maximize the lower bound of the derived log-likelihood. Thus, the indefiniteness in inverse reinforcement learning can be resolved, and since the sampling distribution does not need to be set, it can be applied to various mathematical optimization, especially combinatorial optimization.
For example, typical examples of combinatorial optimization problems include routing problems, scheduling problems, cut-and-pack problems, and assignment and matching problems. Specifically, the routing problem is, for example, the transportation routing problem or the traveling salesman problem, and the scheduling problem is, for example, the job store problem or the work schedule problem. The cut-and-pack problem is, for example, the knapsack problem or the bin packing problem, and the allocation and matching problem is, for example, the maximum matching problem or the generalized allocation problem.
Next, a specific example of a robot control system using the learning device 100 of this exemplary embodiment will be described.
The learning device 100 illustrated in
The training data storage unit 2200 stores training data used, by the learning device 100 for learning. The training data storage unit 2200 may, for example, store decision-making history data of an expert.
The robot 2300 is a device that operates based on a reward function. The robot here is not limited to a device shaped, to resemble a human or an animal, but also includes a device that performs automatic tasks (automatic operation, automatic control, etc.). The robot 2300 includes a storage unit 2310, an input unit 2320, and a control unit 2330.
The memory unit 2310 stores the reward function learned by the learning device 100.
The input unit 2320 accepts input of data indicating the state of the robot in operation.
The control unit 2330 determines actions to be performed by the robot 2300 based on the received (state-indicating) data and the reward function stored in the storage unit 2310. The method in which the control unit 2330 determines the control action based on the reward function is widely known, and a detailed explanation is omitted here. In this exemplary embodiment, a device such as the robot 2300, which performs automatic tasks, can be controlled based on a reward function that reflects the intention of an expert.
Next, an overview of the present invention will be described.
The updating means 93 derives, as a lower limit of the log-likelihood, an expression for subtracting, from the Wasserstein distance, an entropy regularization term defined by an expression for the maximum reward value for the parameter minus the average value of reward for the parameter, and updates the parameter of the reward function to maximize the derived lower limit of the log-likelihood.
Such a configuration allows inverse reinforcement learning to solve the problem of indefiniteness in inverse reinforcement learning while also being applicable to mathematical optimization problems such as combinatorial optimization.
The updating means 93 may set, to the entropy regularization term, an attenuation coefficient (e.g., βt) that attenuates degree to which the portion (e.g., the expression in the second parenthesis of Equation 12) corresponding to the entropy regularization term (e.g., the expression in the second parenthesis of Equation 10) contributes to maximizing the lower limit of the log-likelihood as the process of updating the parameter is repeated, and updates the parameter of the reward function to maximize the set lower limit of log-likelihood.
On the other hand, the updating means 93 may set an attenuation coefficient (e.g., βt that attenuates degree to which the entropy regularization term contributes to maximizing the lower limit of the log-likelihood to the portion corresponding to the entropy regularization term, and, in the course of repeating the process of updating the parameter, change the attenuation coefficient to attenuate degree to which the portion corresponding to the entropy regularization term contributes to maximizing the lower limit of the log-likelihood (e.g., from βt=1 to βt=0, or from βt=1 to βt illustrated in Equation 13 above).
Specifically, the updating means 93 may change the attenuation coefficient when it is determined that the moving average of the log-likelihood has become constant (e.g., the change in the moving average is very small).
The updating means 93 may derive the lower bound for the log-likelihood based on an upper bound of a log sum exponential.
The function input means 91 may accept input of the reward function whose feature is set to be a linear function.
The learning device 90 described above is implemented in the computer 1000. Then, the operation of each processing unit described above is stored in the auxiliary storage device 1003 in the form of a program (learning program). The processor 1001 reads the program from the auxiliary storage device 1003, develops the program in the main storage device 1002, and executes the above processing according to the program.
Note that, in at least one exemplary embodiment, the auxiliary storage device 1003 is an example of a non-transitory tangible medium. Other examples of the non-transitory tangible medium include a magnetic disk, a magneto-optical disk, a compact disc read-only memory (CD-ROM), a digital versatile disk (DVD)-ROM, a semiconductor memory, and the like connected via the interface 1004. Furthermore, in a case where the program is distributed to the computer 1000 via a communication line, the computer 1000 that has received the program may develop the program in the main storage device 1002 and execute the above processing.
Furthermore, the program may be for implementing some of the functions described above. In addition, the program may be a program that implements the above-described functions in combination with another program already stored in the auxiliary storage device 1003, a so-called difference file (difference program).
Although some or all of the above exemplary embodiments may also be described as in the following Supplementary notes, but not limited to the following.
(Supplementary note 1) A learning device comprising:
(Supplementary note 2) The learning device according to Supplementary note 1, wherein
(Supplementary note 3) The learning device according to Supplementary note 1, wherein
(Supplementary note 4) The learning device according to Supplementary note 3, wherein
(Supplementary note 5) The learning device according to any one of Supplementary notes 1 to 4, wherein
(Supplementary note 6) The learning device according to any one of Supplementary notes 1 to 5, wherein
(Supplementary note 7) A learning method for a computer comprising:
(Supplementary note 8) The learning method according to Supplementary note 7, wherein
(Supplementary note 9) The learning method according to Supplementary note 7, wherein
(Supplementary note 10) A program storage medium which stores a learning program for causing a computer to execute:
(Supplementary note 11) The program storage medium according to Supplementary note 10, wherein
(Supplementary note 12) The program storage medium according to Supplementary note 10, wherein
(Supplementary note 13) A learning program for causing a computer to execute:
(Supplementary note 14) The learning program according to Supplementary note 13, wherein
(Supplementary note 15) The learning program according to Supplementary note 13, wherein
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/JP2021/016630 | 4/26/2021 | WO |