The present invention relates to an information processing apparatus, an information processing method, and an information processing program for providing a solution to an online optimization problem.
An algorithm for carrying out optimization of an indicator under a condition in which a function expressing the indicator to be optimized can sequentially change (such optimization is also called online optimization) is known (e.g., Non-Patent Literature 1).
In a method disclosed in Non-Patent Literature 1, subsets X1, X2, . . . , XT are derived which bound an expected value of a regret Σt∈[T]ft(Xt)−minX∈S{Σt∈[T]ft(X)} to be not greater than O(nT1/2).
Meanwhile, online optimization problems are known to be roughly classified into two models below, and a suitable optimization algorithm can vary depending on which one of the models is assumed.
However, in an online optimization problem, it is generally difficult to acquire a priori information pertaining to which one of a stochastic model and an adversarial model should be assumed. Under the circumstances, a technique has been demanded which makes it possible to suitably provide a solution to an optimization problem with respect to both a stochastic model and an adversarial model without referring to a priori information pertaining to which one of the stochastic model and the adversarial model should be assumed.
An example aspect of the present invention is accomplished in view of the above problem, and an example object thereof is to realize a technique which makes it possible to suitably provide a solution to an optimization problem with respect to both a stochastic model and an adversarial model without referring to a priori information pertaining to which one of the stochastic model and the adversarial model should be assumed.
An information processing apparatus in accordance with an example aspect of the present invention includes at least one processor, the at least one processor carrying out: a selection process of selecting a subset Xt⊆[n] of a set [n]={1, 2, . . . , n} (where n is an arbitrary natural number) in a certain round t∈[T] (where T is an arbitrary natural number) with reference to an observation value of an objective function in a round t−1; and an output process of outputting information indicating the subset Xt⊆[n] which has been selected in the selection process, in the selection process, the at least one processor selecting the subset Xt⊆[n] so that an asymptotic behavior of an expected value of a regret Σt∈[T]ft(Xt)−Σt∈[T]ft(X*), which is expressed using an observation value ft(Xt) of an objective function in each round t∈[T] and a comparative solution X*, is bounded from above by an upper limit value A(Δ,n,C) which depends at least on a gap indicator Δ in a stochastic model and on a corruption indicator C indicating an adversarial corruption of the stochastic model.
An information processing apparatus in accordance with an example aspect of the present invention includes at least one processor, the at least one processor carrying out: a selection process of selecting a subset Xt⊆[n] of a set [n]={1, 2, . . . , n} (where n is an arbitrary natural number) in a certain round t∈[T] (where T is an arbitrary natural number) with reference to an observation value of an objective function in a round t−1; and an output process of outputting information indicating the subset Xt⊆[n] which has been selected in the selection process, in the selection process, in each round, the at least one processor calculating, for each i∈[n], an n-dimensional vector xt∈[0,1]n by
using a learning rate λti and a cumulative subgradient Gti, the at least one processor calculating, for all i∈[n−1], a permutation σt:[n]→[n] where xtσ(i)≤xtσ(i+1), the at least one processor deciding values of a random variable ut which are uniformly distributed on [0,1], the at least one processor deciding a subset Xt so that Xt={i∈[n]|xti≥ut} is satisfied, the at least one processor acquiring a value of an objective function ft(X), the at least one processor calculating a subgradient gt∈Rd of an objective function ft, and the at least one processor updating a cumulative subgradient Gt by Gt+1=Gt+gt.
An information processing method in accordance with an example aspect of the present invention includes: selecting a subset Xt⊆[n] of a set [n]={1, 2, . . . , n} (where n is an arbitrary natural number) in a certain round t∈[T] (where T is an arbitrary natural number) with reference to an observation value of an objective function in a round t−1; and outputting information indicating the subset Xt⊆[n] which has been selected, in the selecting, the subset Xt⊆[n] being selected so that an asymptotic behavior of an expected value of a regret Σt∈[T]ft(Xt)−Σt∈[T]ft(X*), which is expressed using an observation value ft(Xt) of an objective function in each round t∈[T] and an optimum solution X*, is bounded from above by an upper limit value A(Δ,n,C) which depends at least on a gap indicator Δ in a stochastic model and on a corruption indicator C indicating an adversarial corruption of the stochastic model.
A computer-readable non-transitory storage medium in accordance with an example aspect of the present invention is a computer-readable non-transitory storage medium storing a program for causing a computer to function as the information processing apparatus described above, the program causing the computer to carry out the selection process and the output process.
According to an example aspect of the present invention, it is possible to suitably provide a solution to an optimization problem with respect to both a stochastic model and an adversarial model without referring to a priori information pertaining to which one of the stochastic model and the adversarial model should be assumed.
The following description will discuss a first example embodiment of the present invention in detail, with reference to the drawings. The present example embodiment is a basic form of example embodiments described later.
An information processing apparatus 1 in accordance with the present example embodiment is, schematically speaking, an information processing apparatus that carries out optimization (also called online optimization) of an indicator under a condition in which a function expressing the indicator to be optimized can sequentially change. In other words, the information processing apparatus 1 is an information processing apparatus that provides a solution to an online optimization problem.
For example, the information processing apparatus 1 decides a certain action in a certain round and acquires an observation value related to a result obtained by the certain action. Then, the process of deciding, with reference to the value of the observation value, an action to be carried out in the next round is repeated. Here, it is assumed that a relation between an action and a result is expressed by, for example, an unknown objective function in which the action is an argument and the result is a function value. Therefore, by referring to an action decided in a certain round and a result obtained by the action, the information processing apparatus 1 acquires (local) information pertaining to the objective function.
Examples of the foregoing action include price setting of one or more products. Examples of the foregoing observation value related to the result include an actual amount of sales obtained by the price setting, and an actual amount of loss (calculated by subtracting an actual amount of sales from a target amount of sales) caused by the price setting. Note, however, that the present example embodiment is not limited to these examples. Examples of the foregoing indicator to be optimized include: a regret that is a difference (or an expected value thereof) between a sum total of actual amounts of sales and a sum total of amounts of sales in a case where an ideal (optimum) action has been taken; a regret that is a difference (or an expected value thereof) between a sum total of actual amounts of loss and a sum total of amounts of loss in a case where an ideal (optimum) action has been taken; and the like. Note, however, that the present example embodiment is not limited to these examples.
In the present example embodiment, an online optimization problem to be solved by the information processing apparatus 1 is positioned, for example, as follows.
An online optimization problem to be solved by the information processing apparatus 1 is, for example, an online optimization problem that is classified into a combination set (family of subsets). Here, in the online optimization problem that is classified into a combination set, a weighted sum or submodularity is assumed as a property of an objective function. In addition, in the online optimization problem that is classified into a combination set, a combination set having a plurality of elements is targeted. Therefore, for example, it is possible to handle a combination of prices of a plurality of products.
An online optimization problem to be solved by the information processing apparatus 1 is an online optimization problem based roughly on the following two settings. Here, a full-information setting has a greater amount of feedback information pertaining to an objective function than a bandit-feedback setting.
Online optimization problems are roughly classified into two models below.
As described later, the information processing apparatus 1 is configured so that an adversarial corruption with respect to a stochastic model (stochastic model with adversarial corruption) can be quantitatively evaluated using a corruption indicator C. The information processing apparatus 1 carries out an algorithm that is applicable to all of a stochastic model, an adversarial model, and an adversarial corruption with respect to a stochastic model.
Therefore, the information processing apparatus 1 can suitably solve an optimization problem with respect to both a stochastic model and an adversarial model by a hybrid algorithm (best of both worlds algorithm) which is suitably applicable to both the stochastic model and the adversarial model without referring to a priori information pertaining to which one of the stochastic model and the adversarial model should be assumed.
Next, the following description will discuss a configuration of the information processing apparatus 1 in accordance with the present example embodiment.
The selection section 11 selects a subset Xt⊆[n] of a set [n]={1, 2, . . . , n} (where n is an arbitrary natural number) in a certain round t∈[T] (where T is an arbitrary natural number) with reference to an observation value of an objective function in a round t−1.
The output section 12 outputs information indicating the subset Xt⊆[n] which has been selected by the selection section 11.
Here, the subset Xt is a subset that defines an action in the round t. For example, the subset Xt has a meaning as a subset including identification information of a product as an element. Note, however, that the present example embodiment is not limited to this. Examples of the foregoing objective function include an objective function that defines a relation between an action and an amount of sales, and an objective function that defines a relation between an action and an amount of loss. Note, however, that the present example embodiment is not limited to these examples.
Each of elements of a set [n]={1, 2, . . . , n} (where n is an arbitrary natural number) in accordance with the present example embodiment corresponds, in a one-to-one manner, to each of n elements constituting an arbitrary set S. Therefore, a process described in the present example embodiment can be applied to an arbitrary set S consisting of n elements.
The selection section 11 in accordance with the present example embodiment selects the subset Xt⊆[n] so that an asymptotic behavior of an expected value of a regret Σt∈[T]ft(Xt)−Σt∈[T]ft(X*), which is expressed using an observation value ft(Xt) of an objective function in each round t∈[T] and a comparative solution X*, is bounded from above by an upper limit value A(Δ,n,C) which depends at least on a gap indicator Δ in a stochastic model and on a corruption indicator C indicating an adversarial corruption of the stochastic model. Note that the comparative solution is, for example, an optimum solution that minimizes an objective function expressing a loss.
In a case where an objective function expressing an amount of sales is used, a function in which the sign of the above described regret is reversed may be used. In other words, it is possible to express that the subset Xt⊆[n] is selected so that an asymptotic behavior of an absolute value of an expected value of the regret Σt∈[T]ft(Xt)−Σt∈[T]ft(X*) is bounded from above by an upper limit value A(Δ,n,C) which depends at least on a gap indicator Δ in a stochastic model and on a corruption indicator C indicating an adversarial corruption of the stochastic model.
In the present example embodiment, the objective function ft is, for example, a function (submodular function) satisfying submodularity. That is, the objective function ft satisfies, for an arbitrary subset X,Y satisfying X,Y⊆[n], the following inequality: ft(X∩Y)+ft(X∪Y)≤ft(X)+ft(Y). Note, however, that the present example embodiment is not limited to this.
A destination to which the output section 12 outputs information indicating the subset Xt⊆[n] does not limit the present example embodiment and, for example, a configuration may be employed in which the output section 12 includes a display panel and information indicating the subset Xt is displayed on the display panel. Alternatively, a configuration may be employed in which the output section 12 provides another apparatus with information indicating the subset Xt, and the another apparatus displays the information or refers to the information to automatically update a price of a product or the like.
The gap indicator is expressed as
using an expected value
of the objective function ft in a case where the objective function ft follows a probability distribution D.
The corruption indicator C is expressed as
using the objective function ft and a time-dependent objective function ft′.
In the information processing apparatus 1 configured as described above, an expected value of a regret is bounded from above by an upper limit value A(Δ,n,C). Here, the upper limit value is defined so as to encompass both a stochastic model and an adversarial model, and is expressed using a corruption indicator C indicating an adversarial corruption of the stochastic model, as described above. Therefore, according to the information processing apparatus 1 configured as described above, it is possible to suitably solve an optimization problem with respect to both a stochastic model and an adversarial model.
The following description will discuss a flow of an information processing method S1 which is carried out by the information processing apparatus 1, with reference to
In step S11, the selection section 11 selects a subset Xt⊆[n] of a set [n]={1, 2, . . . , n} (where n is an arbitrary natural number) in a certain round t∈[T] (where T is an arbitrary natural number) with reference to an observation value of an objective function in a round t−1.
In step S12, the output section 12 outputs information indicating the subset Xt⊆[n] which has been selected by the selection section 11.
Here, in step S11, the selection section 11 selects the subset Xt⊆[n] so that an asymptotic behavior of an expected value of a regret Σt∈[T]ft(Xt)−Σt∈[T]ft(X*), which is expressed using an observation value ft(Xt) of an objective function in each round t∈[T] and a comparative solution X*, is bounded from above by an upper limit value A(Δ,n,C) which depends at least on a gap indicator Δ in a stochastic model and on a corruption indicator C indicating an adversarial corruption of the stochastic model. Note that the comparative solution is, for example, an optimum solution that minimizes an objective function expressing a loss, as described above.
In the information processing method S1 configured as described above, an expected value of a regret is bounded from above by an upper limit value A(Δ,n,C). Here, the upper limit value is defined so as to encompass both a stochastic model and an adversarial model, and is expressed using a corruption indicator C indicating an adversarial corruption of the stochastic model, as described above. Therefore, according to the information processing method S1 as described above, it is possible to suitably solve an optimization problem with respect to both a stochastic model and an adversarial model.
(Comparison of an Algorithm by Information Processing Method S1 with Other Algorithms)
As indicated in the first graph from the top in
The third graph from the top in
As indicated in the third graph from the top in
As such, according to the information processing apparatus 1 and the information processing method S1 described above, it is possible to suitably solve an optimization problem with respect to both a stochastic model and an adversarial model.
The following description will discuss a second example embodiment of the present invention in detail with reference to the drawings. The same reference numerals are given to constituent elements which have functions identical with those described in the first example embodiment, and descriptions as to such constituent elements are omitted as appropriate.
The following description will discuss a configuration of an information processing system 100A in accordance with the present example embodiment, with reference to
As illustrated in
The storage section 17A stores an observation value OB of an objective function which has been received from the terminal apparatus 2A (described later). In addition, the storage section 17A stores a subset SB which has been selected by the selection section 11.
The communication section 19A communicates with an apparatus external to the information processing apparatus 1A. For example, the communication section 19A communicates with the terminal apparatus 2A. The communication section 19A transmits data supplied from the control section 10A to the terminal apparatus 2A and supplies data received from the terminal apparatus 2A to the control section 10A.
As illustrated in
The acquisition section 13 acquires an observation value OB of an objective function ft in each round t∈[T] (where T is an arbitrary natural number) from the terminal apparatus 2A via the communication section 19A. The acquisition section 13 causes the storage section 17A to store the acquired observation value OB of the objective function. Here, information that the acquisition section 13 can acquire may vary in accordance with whether the setting is a full-information setting or a bandit-feedback setting.
In the full-information setting, the acquisition section 13 can acquire an observation value ft(X) of an objective function with respect to an arbitrary subset X⊆[n], the observation value ft(X) being obtained after the output section 12 has output a subset Xt in a round t.
Meanwhile, in the bandit-feedback setting, the acquisition section 13 can acquire an observation value ft(Xt) of an objective function with respect to the selected subset Xt, the observation value ft(Xt) being obtained after the output section 12 has output a subset Xt in a round t. However, in the bandit-feedback setting, the acquisition section 13 cannot acquire an observation value ft(X) of an objective function with respect to a subset X⊆[n] other than the selected subset.
Here, as with the first example embodiment, the subset Xt is a subset that defines an action in the round t. For example, the subset Xt has a meaning as a subset including identification information of a product as an element. Note, however, that the present example embodiment is not limited to this. Examples of the foregoing objective function include an objective function that defines a relation between an action and an amount of sales, and an objective function that defines a relation between an action and an amount of loss. Note, however, that the present example embodiment is not limited to these examples.
The selection section 11 selects a subset Xt⊆[n] of a set [n]={1, 2, . . . , n} (where n is an arbitrary natural number) in a certain round with reference to an observation value of an objective function in a round t−1.
Here, the selection section 11 in accordance with the present example embodiment selects the subset Xt⊆[n] so that an asymptotic behavior of an expected value of a regret Σt∈[T]ft(Xt)−Σt∈[T]ft(X*), which is expressed using an observation value ft(Xt) of an objective function in each round t∈[T] and a comparative solution X*, is bounded from above by an upper limit value A(Δ,n,C) which depends at least on a gap indicator Δ in a stochastic model and on a corruption indicator C indicating an adversarial corruption of the stochastic model. Note that a comparative solution is, for example, an optimum solution that minimizes an objective function expressing a loss, as with the first example embodiment. A specific process carried out by the selection section 11 will be described later.
The output section 12 outputs information indicating the subset Xt⊆[n] which has been selected by the selection section 11. A destination to which the output section 12 outputs information indicating the subset Xt⊆[n] does not limit the present example embodiment and, for example, a configuration may be employed in which the output section 12 transmits the information to the terminal apparatus 2A via the communication section 19A. Alternatively, a configuration may be employed in which the output section 12 includes a display panel and information indicating the subset Xt is displayed on the display panel.
As illustrated in
The communication section 29A communicates with an apparatus external to the terminal apparatus 2A. For example, the communication section 29A communicates with the information processing apparatus 1A. The communication section 29A transmits data supplied from the control section 20A to the information processing apparatus 1A and supplies data received from the information processing apparatus 1A to the control section 20A.
The display section 27A displays data for display supplied from the control section 20A. For example, the display section 27A displays information indicating the subset Xt which has been selected by the selection section 11 of the information processing apparatus 1A and supplied to the terminal apparatus 2A.
The input reception section 28A receives various kinds of input to the terminal apparatus 2A. For example, the input reception section 28A receives an observation value of an objective function in each round t. Then, the input reception section 28A supplies the received observation value to the control section 20A. The supplied observation value is transmitted to the information processing apparatus 1A via the communication section 29A and is acquired by the acquisition section 13 described above. The input reception section 28A may be configured to receive the observation value via an operation by a user, or may be configured to automatically acquire the observation value.
A specific configuration of the input reception section 28A does not limit the present example embodiment and, for example, the input reception section 28A may be configured to include an input device such as a keyboard, a touch pad, and the like. Alternatively, a configuration may be employed in which the input reception section 28A includes a data scanner or the like that reads data via electromagnetic waves such as infrared rays or radio waves.
As illustrated in
The action execution section 21, in each round t, acquires information indicating the subset Xt which has been selected by the selection section 11 of the information processing apparatus 1A, and carries out an action corresponding to the acquired subset Xt. For example, the action execution section 21 updates prices of one or more products indicated by the subset Xt with reference to information indicating the subset Xt. Alternatively, it is possible to employ a configuration in which the action execution section 21 generates display data indicating the subset Xt and supplies the generated display data to the display section 27A, and the display section 27A displays the display data. In this configuration, a user updates prices of one or more products with reference to the display data displayed by the display section 27A.
The observation value acquisition section 22 acquires, via the input reception section 28, an observation value of an objective function after the action execution section 21 has carried out an action. The observation value of the objective function acquired by the observation value acquisition section 22 is supplied to the information processing apparatus 1A via the communication section 29A and is acquired by the acquisition section 13 of the information processing apparatus 1A.
Next, the following description will discuss, with reference to
As illustrated in
In the full-information setting, in step S23(t−1), the terminal apparatus 2A can provide an observation value ft−1(X) of an objective function with respect to an arbitrary subset X⊆[n], the observation value ft−1(X) being obtained after the output section 12 has output a subset Xt−1 in a round t−1.
Meanwhile, in the bandit-feedback setting, the terminal apparatus 2A can acquire, in step S23(t−1), an observation value ft−1(Xt−1) of an objective function with respect to the selected subset Xt−1, the observation value ft−1(Xt−1) being obtained after the output section 12 has output the subset Xt−1 in the round t−1. However, in the bandit-feedback setting, the terminal apparatus 2A cannot acquire, in step S23(t−1), an observation value ft−1(X) of an objective function with respect to a subset X⊆[n] other than the selected subset.
Next, in step S13(t−1), the acquisition section 13 of the information processing apparatus 1A acquires the observation value ft−1 of the objective function which has been provided by the terminal apparatus 2A in step S23(t−1).
Next, in step S11(t), the selection section 11 of the information processing apparatus 1A selects a subset Xt⊆[n] with reference to the observation value ft−1 of the objective function which has been acquired by the acquisition section in step S13(t−1). A specific process carried out by the selection section 11 will be described later.
Next, in step S12(t), the output section 12 outputs information indicating the subset Xt⊆[n] which has been selected by the selection section 11 in step S11(t). The information indicating the outputted subset Xt⊆[n] is transmitted to the terminal apparatus 2A via the communication section 19A.
Next, in step S21(t), the action execution section 21 of the terminal apparatus 2A carries out an action corresponding to the information indicating the subset Xt⊆[n] which has been output by the output section 12 in step S12(t). A specific process carried out by the action execution section 21 has been described above, and thus a description thereof is omitted here.
Next, in step S22(t), the observation value acquisition section 22 of the terminal apparatus 2A acquires an observation value of an objective function which is obtained after the action carried out by the action execution section 21 in step S21(t).
Next, in step S23(t), the terminal apparatus 2A provides the information processing apparatus 1A with the observation value of the objective function which has been acquired in step S22(t).
Subsequently, as illustrated in
Next, the following description will discuss, with reference to
First, in step S101, the selection section 11 initializes various parameters to be used in processing. For example, the selection section 11 initializes a cumulative subgradient G1 in a first round by G1i=0∈Rn. Here, i is an index satisfying i∈[n], and [n] is a set of natural numbers [n]={1, 2, . . . , n} (where n is an arbitrary natural number).
Step S102 is a starting end of a loop process that is expressed by a loop variable t (t=1, 2, . . . , T) (where T is an arbitrary natural number). Here, the loop variable t is an index indicating a round number.
In step S111A, the selection section 11 calculates a vector xt∈[0,1]n by the following formula.
Here, λti is a parameter indicating a learning rate, and is defined by the following formula.
Here, h is defined, for all z∈[0,1] and g∈R, by the following formula.
Next, in step S112A, the selection section 11 calculates, for all i∈[n−1], a permutation σt:[n]→[n] where xtσ(i)≤xtσ(i+1).
Next, in step S113A, the selection section 11 decides values of a random variable ut which are uniformly distributed on [0,1]. In other words, the selection section 11 decides values of the variable ut in accordance with a uniform probability distribution on [0,1].
Next, in step S114A, the selection section 11 decides a subset Xt so that Xt={i∈[n]|xti≥ut} is satisfied.
Next, in step S12, the output section 12 outputs the subset Xt which has been selected by the selection section 11 in step S114A. The output subset Xt is, for example, supplied to the terminal apparatus 2A, and an action corresponding to the subset Xt is carried out in an environment on the terminal apparatus 2A side.
Next, in step S13, the acquisition section 13 acquires an observation value ft(X) of an objective function ft. In this step, the acquisition section 13 can acquire an observation value ft(X) of an objective function with respect to an arbitrary subset X⊆[n], the observation value ft(X) being obtained after the output section 12 has output a subset Xt in step S12.
Next, in step S115A, the selection section 11 calculates a subgradient gt∈Rd by the following formula.
Here, ρi(σt) is defined by the following formula.
xi∈{0,1}n represents an indicator vector of i. Only when i=j, xij=1.
Next, in step S116A, the selection section 11 updates the cumulative subgradient Gt by Gt+1=Gt+gt.
Step S103 is a terminus end of the loop process expressed by the loop variable t.
The inventors of the present application have succeeded in proving that a regret
obtained by the above-described processing example 1 (algorithm 1) is bound from above by
(theorem 1). The inventors of the present application have also indicated that, as a system of theorem 1, the regret RT is bound from above as below.
In other words, it has been indicated that the regret RT is bounded from above by an upper limit value A(Δ,n,C) which depends on Δ, n, and C.
Here, the parameter Δ≥0 represents a suboptimality gap in a stochastic model. More specifically, the parameter Δ is expressed as
using an expected value
of an objective function ft in a case where the objective function ft follows an unknown distribution D.
The parameter C represents a corruption indicator, and the corruption indicator C is expressed as
using the objective function ft and a time-dependent objective function ft′. Here, the time-dependent objective function ft′ is selected from the unknown distribution D which is independent of a round.
In the information processing apparatus 1A configured as described above, in the full-information setting, an expected value of a regret is bounded from above by an upper limit value A(Δ,n,C) which depends on Δ, n, and C, by carrying out the algorithm 1. Therefore, according to the information processing apparatus 1 configured as described above, in the full-information setting, it is possible to suitably solve an optimization problem with respect to both a stochastic model and an adversarial model.
Next, the following description will discuss, with reference to
First, in step S101, the selection section 11 initializes various parameters to be used in processing. For example, the selection section 11 initializes a cumulative subgradient {circumflex over ( )}G1 in a first round by {circumflex over ( )}G1i=0∈Rn. Here, i is an index satisfying i∈[n], and [n] is a set of natural numbers [n]={1, 2, . . . , n} (where n is an arbitrary natural number). Moreover, {circumflex over ( )}Gti represents “Gti” with a hat “{circumflex over ( )}”.
Step S102 is a starting end of a loop process that is expressed by a loop variable t (t=1, 2, . . . , T) (where T is an arbitrary natural number). Here, the loop variable t is an index indicating a round number.
In step S111B, the selection section 11 calculates, for each i∈[n], an n-dimensional vector xt∈[0,1]n by xti=ζ({circumflex over ( )}Gti/λt) using a learning rate λt, a cumulative subgradient {circumflex over ( )}Gti, and a function ζ below.
Here, the learning rate λt is defined by the following formula.
Next, in step S112B, the selection section 11 calculates, for all i∈[n−1], a permutation σt:[n]→[n] where xtσ(i)≤xtσ(i+1).
In step S113B, the selection section 11 selects an index it∈{0, 1, . . . , n} in accordance with a probability below.
In step S114B, the selection section 11 decides a subset Xt so that Xt=σt([it])={σt(j)|j∈[it]} is satisfied.
Next, in step S12, the output section 12 outputs the subset Xt which has been selected by the selection section 11 in step S114B. The output subset Xt is, for example, supplied to the terminal apparatus 2A, and an action corresponding to the subset Xt is carried out in an environment on the terminal apparatus 2A side.
Next, in step S13, the acquisition section 13 acquires an observation value ft(X) of an objective function ft. In this step, the acquisition section 13 can acquire an observation value ft(Xt) of an objective function with respect to the selected subset Xt, the observation value ft(Xt) being obtained after the output section 12 has output a subset Xt in step S12. However, in this step, the acquisition section 13 cannot acquire an observation value ft(X) of an objective function with respect to a subset X⊆[n] other than the selected subset.
Next, in step S115B, the selection section 11 calculates a subgradient {circumflex over ( )}gt∈Rd by the following formula.
Here, {circumflex over ( )}gt represents “gt” with a hat “{circumflex over ( )}”. Moreover, ρi(σt) is defined by
as described above.
Next, in step S116B, the selection section 11 updates the cumulative subgradient {circumflex over ( )}Gt by {circumflex over ( )}Gt+1={circumflex over ( )}Gt+{circumflex over ( )}gt.
Step S103 is a terminus end of the loop process expressed by the loop variable t.
The inventors of the present application have succeeded in proving that a regret
obtained by the above-described processing example 2 (algorithm 2) is bound from above by
(theorem 2). The inventors of the present application have also indicated that, as a system of theorem 2, the regret RT is bound from above as below.
In other words, it has been indicated that the regret RT is bounded from above by an upper limit value A(Δ,n,C) which depends on Δ, n, and C. Here, the parameter Δ represents a suboptimality gap as with the process example 1, and the parameter C represents a corruption indicator as with the process example 1.
In the information processing apparatus 1A configured as described above, in the bandit-feedback setting, an expected value of a regret is bounded from above by an upper limit value A(Δ,n,C) which depends on Δ, n, and C, by carrying out the algorithm 2. Therefore, according to the information processing apparatus 1 configured as described above, in the bandit-feedback setting, it is possible to suitably solve an optimization problem with respect to both a stochastic model and an adversarial model.
(Relation with Lovasz Extension)
The following description will discuss a relation between the above-described algorithm 1 and algorithm 2 and Lovasz extension.
When a function f:2[n]→R is given, a Lovasz extension ˜f:[0,1]n→R of a function f is given as below. Here, “˜f” represents “f” with a tilde.
First, it is assumed that a set of indices i in which xi≥u for x=(x1, x2, . . . , xn)T∈[0,1]n and u∈[0,1] is expressed as Hu(x). That is, Hu(x) is defined by Hu(x)={i∈[n]|xi≥u}. Using this Hu(x), the Lovasz extension ˜f(x) is defined by the following formula.
Here, Unif([0,1]) represents a uniform distribution on [0,1]. The Lovasz extension ˜f(x) is known to be a convex function only when the function f is submodular.
From the above definition, for an arbitrary permutation σ:[n]→[n] satisfying xσ(i)≤xσ(i+1) for arbitrary x∈[0,1]n and arbitrary i∈[n−1], a Lovasz extension ˜f(x) is expressed as below.
Here, σ[i]={σ(j)|j∈[i]}. Exceptionally, it is defined that xσ(0)=0 and xσ(n+1)=1.
Thus, a subgradient g(σ)∈Rn of the Lovasz extension ˜f(x) is defined as below.
Here, ρi(σ) is as described in the algorithm 1 and algorithm 2.
The subgradient of the Lovasz extension ˜f(x) is used in both the algorithm 1 and the algorithm 2, as described above.
(Relation with FTRL Algorithm)
The following description will discuss a relation between the above-described algorithm 1 and algorithm 2 and a follow the regularized leader (FTRL) algorithm.
The FTRL algorithm is a general and sophisticated approach in online convex optimization on a subset Ω of Rn. An update rule in the FTRL algorithm is expressed as below.
Here, gt is a subgradient of an objective function ft in xt, and ψt is a regularizer which is a convex function on Ω. In the FTRL algorithm, it can be indicated that, for xt∈Ω and arbitrary x*∈Ω, a regret is bound as follows.
Here, Dt is a Bergman information quantity (Bergman divergence) associated with ψt.
The above-described algorithm 1 corresponds to the fact that, in the FTRL algorithm, the regularizer ψt is defined as follows.
Here, λti is the learning rate described above. In the FTRL algorithm that has a regularizer defined as above, it can be indicated that xti and a cumulative subgradient Gt are expressed as below.
The above xti and cumulative subgradient Gt are used in the foregoing algorithm 1.
Meanwhile, the algorithm 2 corresponds to the fact that, in the FTRL algorithm, the regularizer ψt is defined as follows.
In the FTRL algorithm that has a regularizer defined as above, it can be indicated that xti is expressed as follows.
Here, the function ζ is as described in the algorithm 2. The above xti and cumulative subgradient {circumflex over ( )}Gt are used in the foregoing algorithm 2.
Next, a display example by the information processing system 1A will be described with reference to
Then, as illustrated in
The information processing system 1A can present the amount of sales and the prices of the products to a user by carrying out such display.
The foregoing information processing apparatuses 1 and 1A can be applied to various problems. Examples thereof are given below.
An action is assumed to be selection of a path from one point to another point. For example, it is assumed that there are n−1 relay points from one point to another point and there are m selectable paths in each section. It is assumed to indicate that, in a case where an action measure (selected subset) Xt is [0, 2, 1, . . . ] in such a condition, a path 0 is selected in a first section, a path 2 is selected in a second section, and a path 1 is selected in a third section.
An objective function ft receives the action measure Xt as input and outputs a time taken to pass through a path indicated by the action measure. In this case, by applying the above-described optimization method, it is possible to derive an optimum path setting for reaching the another point from the one point in a time as short as possible.
An action is assumed to be discount of prices of various kinds of beer of respective companies in a certain store. For example, it is assumed that, in a case where an action measure (selected subset) Xt is [0, 2, 1, . . . ], a first element indicates that a beer price of a company A is a list price, a second element indicates that a beer price of a company B is increased by 10% from a list price, and a third element indicates that a beer price of a company C is reduced by 10% from a list price.
An objective function ft receives the action measure Xt as input and outputs a result of sale made while applying the action measure X to the beer prices of the respective companies. In this case, optimum price setting of beer prices of the respective companies in the above store can be derived by applying the above-described optimization method.
The following description will discuss a case in which the method is applied to an investment action by an investor, or the like. In this case, an action measure Xt is assumed to be investment (purchase, capital increase), sell-off, or holding of a plurality of financial products (e.g., stock brands) which the investor is holding or intends to hold. For example, it is assumed that, in a case where an action measure (selected subset) Xt is [1, 0, 2, . . . ], a first element indicates an additional investment in stocks of a company A, a second element indicates holding (neither purchase nor sell-off) a credit of a company B, and a third element indicates sell-off of stocks of a company C.
An objective function ft receives the action measure Xt as input and outputs a result of applying the action measure Xt to the investment action with respect to financial products of the respective companies. In this case, optimum investment actions with respect to the respective brands by the investor can be derived by applying the above-described optimization method.
The following description will discuss a case where the method is applied to a dosing action for a clinical trial of a certain drug in a pharmaceutical company. In this case, an action measure Xt is assumed to be a dosage of drug or avoidance of dosing. For example, it is assumed that, in a case where an action measure (selected subset) Xt is [1, 0, 2, . . . ], a first element indicates that dosing in a dosage 1 is carried out with respect to a subject A, a second element indicates that dosing is not carried out with respect to a subject B, and a third element indicates that dosing in a dosage 2 is carried out with respect to a subject C.
An objective function ft receives the action measure Xt as input and outputs a result of applying the action measure Xt to the dosing actions with respect to the respective subjects. In this case, optimum dosing actions with respect to the respective subjects in the clinical trial by the pharmaceutical company can be derived by applying the above-described optimization method.
The following description will discuss a case in which the method is applied to an advertising action (marketing measure) in an operating company of a certain electronic commerce site. In this case, an action measure Xt is assumed to be advertisement (an online (banner) advertisement, advertisement by e-mail, direct mail, transmission of a discount coupon by e-mail, or the like) with respect to a plurality of customers for a product or service which the operating company intends to sell. For example, it is assumed that, in a case where an action measure (selected subset) Xt is [1, 0, 2, . . . ], a first element indicates a banner advertisement for a customer A, a second element indicates that no advertisement is given to a customer B, and a third element indicates transmission of a discount coupon to a customer C by e-mail.
An objective function ft receives the action measure Xt as input and outputs a result of applying the action measure Xt to the advertising actions with respect to the respective customers. Here, execution results may be whether or not the banner advertisement has been clicked, a purchase amount, a purchase probability, and an expected value of purchase amount. In this case, optimum advertising actions with respect to the respective customers by the operating company can be derived by applying the optimization method in accordance with the present example embodiment.
Some or all of the functions of each of the information processing apparatuses 1 and 1A and the terminal apparatus 2A may be implemented by hardware such as an integrated circuit (IC chip), or may be implemented by software.
In the latter case, each of the information processing apparatuses 1 and 1A and the terminal apparatus 2A is realized by, for example, a computer that executes instructions of a program that is software realizing the foregoing functions.
Examples of the processor C1 include a central processing unit (CPU), a graphic processing unit (GPU), a digital signal processor (DSP), a micro processing unit (MPU), a floating point number processing unit (FPU), a physics processing unit (PPU), a microcontroller, and a combination thereof. Examples of the memory C2 include a flash memory, a hard disk drive (HDD), a solid state drive (SSD), and a combination thereof.
Note that the computer C can further include a random access memory (RAM) in which the program P is loaded when the program P is executed and in which various kinds of data are temporarily stored. The computer C can further include a communication interface for carrying out transmission and reception of data with other apparatuses. The computer C can further include an input-output interface for connecting input-output apparatuses such as a keyboard, a mouse, a display and a printer.
The program P can be stored in a computer C-readable, non-transitory, and tangible storage medium M. The storage medium M can be, for example, a tape, a disk, a card, a semiconductor memory, a programmable logic circuit, or the like. The computer C can obtain the program P via the storage medium M. The program P can be transmitted via a transmission medium. The transmission medium can be, for example, a communication network, a broadcast wave, or the like. The computer C can obtain the program P also via such a transmission medium.
The present invention is not limited to the foregoing example embodiments, but may be altered in various ways by a skilled person within the scope of the claims. For example, the present invention also encompasses, in its technical scope, any example embodiment derived by appropriately combining technical means disclosed in the foregoing example embodiments.
Some or all of the foregoing example embodiments can also be described as below. Note, however, that the present invention is not limited to the following supplementary notes.
An information processing apparatus, including: a selection means for selecting a subset Xt⊆[n] of a set [n]={1, 2, . . . , n} (where n is an arbitrary natural number) in a certain round t∈[T] (where T is an arbitrary natural number) with reference to an observation value of an objective function in a round t−1; and an output means for outputting information indicating the subset Xt⊆[n] which has been selected by the selection means, the selection means selecting the subset Xt⊆[n] so that an asymptotic behavior of an expected value of a regret Σt∈[T]ft(Xt)−Σt∈[T]ft(X*), which is expressed using an observation value ft(Xt) of an objective function in each round t∈[T] and a comparative solution X*, is bounded from above by an upper limit value A(Δ,n,C) which depends at least on a gap indicator Δ in a stochastic model and on a corruption indicator C indicating an adversarial corruption of the stochastic model.
The information processing apparatus described in supplementary note 1, in which: the gap indicator is expressed as
using an expected value
of the objective function ft in a case where the objective function ft follows a probability distribution D; and the corruption indicator C is expressed as
using the objective function ft and a time-dependent objective function ft′.
The information processing apparatus described in supplementary note 1 or 2, further including: an acquisition means for acquiring an observation value ft(X) of the objective function with respect to an arbitrary subset X⊆[n] after the output means has output the subset Xt in the round t, the selection means being capable of referring to the observation value ft(X) which has been acquired by the acquisition means, and the upper limit value A(Δ,n,C) being expressed as
The information processing apparatus described in supplementary note 3, in which, the selection means carries out, in each round: a vector calculation step of calculating, for each i∈[n], an n-dimensional vector xt∈[0,1]n by
using a learning rate λti and a cumulative subgradient Gti; a permutation calculation step of calculating, for all i∈[n−1], a permutation σt:[n]→[n] where xtσ(i)≤xtσ(i+1); a random variable decision step of deciding values of a random variable ut which are uniformly distributed on [0,1]; a subset decision step of deciding a subset Xt so that Xt={i∈[n]|xti≥ut} is satisfied; an acquisition step of acquiring an observation value ft(X) of an objective function; a subgradient calculation step of calculating a subgradient gt∈Rd of an objective function ft; and an updating step of updating a cumulative subgradient Gt by Gt+1=Gt+gt.
The information processing apparatus described in supplementary note 1 or 2, further including: an acquisition means for acquiring an observation value ft(Xt) of the objective function with respect to a selected subset Xt after the output means has output the selected subset Xt in a round t, the selection means being capable of referring to an observation value ft(Xt) of the objective function with respect to the selected subset Xt, and being incapable of referring to an observation value ft(X) of the objective function with respect to a subset X⊆[n] other than the selected subset, and the upper limit value A(Δ,n,C) being expressed as
The information processing apparatus described in supplementary note 5, in which, the selection means carries out, in each round: a vector calculation step of calculating, for each i∈[n], an n-dimensional vector xt∈[0,1]n by xti=ζ({circumflex over ( )}Gti/λt) using a learning rate λt, a cumulative subgradient {circumflex over ( )}Gti, and a function ζ below,
a permutation calculation step of calculating, for all i∈[n−1], a permutation σt:[n]→[n] where xtσ(i)≤xtσ(i+1); an index selection step of selecting an index it∈{0, 1, . . . , n} in accordance with a probability below,
a subset decision step of deciding a subset Xt so that Xt={σt(j)|j∈[it]} is satisfied; a step of acquiring an observation value ft(Xt) of an objective function; a subgradient calculation step of calculating a subgradient {circumflex over ( )}gt∈Rd of an objective function ft; and an updating step of updating a cumulative subgradient {circumflex over ( )}Gti by {circumflex over ( )}Gt+1={circumflex over ( )}Gt+{circumflex over ( )}gt.
An information processing apparatus, including: a selection means for selecting a subset Xt⊆[n] of a set [n]={1, 2, . . . , n} (where n is an arbitrary natural number) in a certain round t∈[T] (where T is an arbitrary natural number) with reference to an observation value of an objective function in a round t−1; and an output means for outputting information indicating the subset Xt⊆[n] which has been selected by the selection means, the selection means carrying out, in each round, a vector calculation step of calculating, for each i∈[n], an n-dimensional vector xt∈[0,1]n by
using a learning rate λti and a cumulative subgradient Gti, a permutation calculation step of calculating, for all i∈[n−1], a permutation σt:[n]→[n] where xtσ(i)≤xtσ(i+1), a random variable decision step of deciding values of a random variable ut which are uniformly distributed on [0,1], a subset decision step of deciding a subset Xt so that Xt={i∈[n]|xti≥ut} is satisfied, an acquisition step of acquiring a value of an objective function ft(X), a subgradient calculation step of calculating a subgradient gt∈Rd of an objective function ft, and an updating step of updating a cumulative subgradient Gt by Gt+1=Gt+gt.
An information processing apparatus, including: a selection means for selecting a subset Xt⊆[n] of a set [n]={1, 2, . . . , n} (where n is an arbitrary natural number) in a certain round t∈[T] (where T is an arbitrary natural number) with reference to an observation value of an objective function in a round t−1; and an output means for outputting information indicating the subset Xt⊆[n] which has been selected by the selection means, the selection means carrying out, in each round, a vector calculation step of calculating, for each i∈[n], an n-dimensional vector xt∈[0,1]n by xti=({circumflex over ( )}Gti/λt) using a learning rate λt, a cumulative subgradient {circumflex over ( )}Gti, and a function ζ below,
a permutation calculation step of calculating, for all i∈[n−1], a permutation σt:[n]→[n] where xtσ(i)≤xtσ(i+1), an index selection step of selecting an index it∈{0, 1, . . . , n} in accordance with a probability below,
a subset decision step of deciding a subset Xt so that Xt={σt(j)|j∈[it]} is satisfied, a step of acquiring a value of an objective function ft(Xt), a subgradient calculation step of calculating a subgradient {circumflex over ( )}gt∈Rd of an objective function ft, and an updating step of updating a cumulative subgradient {circumflex over ( )}Gti by {circumflex over ( )}Gt+1={circumflex over ( )}Gt+{circumflex over ( )}gt.
An information processing method, including: selecting a subset Xt⊆[n] of a set [n]={1, 2, . . . , n} (where n is an arbitrary natural number) in a certain round t∈[T] (where T is an arbitrary natural number) with reference to an observation value of an objective function in a round t−1; and outputting information indicating the subset Xt⊆[n] which has been selected, in the selecting, the subset Xt⊆[n] being selected so that an asymptotic behavior of an expected value of a regret Σt∈[T]ft(Xt)−Σt∈[T]ft(X*), which is expressed using an observation value ft(Xt) of an objective function in each round t∈[T] and an optimum solution X*, is bounded from above by an upper limit value A(Δ,n,C) which depends at least on a gap indicator Δ in a stochastic model and on a corruption indicator C indicating an adversarial corruption of the stochastic model.
An information processing program for causing a computer to carry out: a process of selecting a subset Xt⊆[n] of a set [n]={1, 2, . . . , n} (where n is an arbitrary natural number) in a certain round t∈[T] (where T is an arbitrary natural number) with reference to an observation value of an objective function in a round t−1; and a process of outputting information indicating the subset Xt⊆[n] which has been selected, in the process of selecting, the subset Xt⊆[n] being selected so that an asymptotic behavior of an expected value of a regret Σt∈[T]ft(Xt)−Σt∈[T]ft(X*), which is expressed using an observation value ft(Xt) of an objective function in each round t∈[T] and an optimum solution X*, is bounded from above by an upper limit value A(Δ,n,C) which depends at least on a gap indicator Δ in a stochastic model and on a corruption indicator C indicating an adversarial corruption of the stochastic model.
A computer-readable storage medium storing a program described in supplementary note 10.
An information processing apparatus including at least one processor, the at least one processor carrying out: a process of selecting a subset Xt⊆[n] of a set [n]={1, 2, . . . , n} (where n is an arbitrary natural number) in a certain round t∈[T] (where T is an arbitrary natural number) with reference to an observation value of an objective function in a round t−1; and a process of outputting information indicating the subset Xt⊆[n] which has been selected, in the process of selecting, the subset Xt⊆[n] being selected so that an asymptotic behavior of an expected value of a regret Σt∈[T]ft(Xt)−Σt∈[T]ft(X*), which is expressed using an observation value ft(Xt) of an objective function in each round t∈[T] and an optimum solution X*, is bounded from above by an upper limit value A(Δ,n,C) which depends at least on a gap indicator Δ in a stochastic model and on a corruption indicator C indicating an adversarial corruption of the stochastic model.
Note that the information processing apparatus can further include a memory. The memory can store a program for causing the at least one processor to carry out the selecting process and the outputting process. The program can be stored in a computer-readable non-transitory tangible storage medium.
This application is a National Stage Entry of PCT/JP2021/036579 filed on Oct. 4, 2021, the contents of all of which are incorporated herein by reference, in their entirety.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/JP2021/036579 | 10/4/2021 | WO |