Optimization apparatus, optimization method, and non-transitory computer-readable medium in which optimization program is stored

Information

  • Patent Grant
  • 11949809
  • Patent Number
    11,949,809
  • Date Filed
    Monday, October 7, 2019
    5 years ago
  • Date Issued
    Tuesday, April 2, 2024
    9 months ago
Abstract
An optimization apparatus (100) includes a setting unit (110) that sets a predetermined non-linear objective function, a policy determination unit (120) that determines a policy to be executed in online optimization in a bandit problem, based on the non-linear objective function, a policy execution unit (130) that acquires a reward as an execution result of the determined policy, an update rate determination unit (140) that determines an update rate of the non-linear objective function by a multiplicative weight update method, based on the acquired reward and the non-linear objective function, and an update unit (150) that updates the non-linear objective function, based on the update rate.
Description

This application is a National Stage Entry of PCT/JP2019/039519 filed on Oct. 7, 2019, the contents of all of which are incorporated herein by reference, in their entirety.


TECHNICAL FIELD

The present invention relates to an optimization apparatus, an optimization method, and an optimization program, and more particularly, to an optimization apparatus, an optimization method, and an optimization program that perform online optimization in a bandit problem.


BACKGROUND ART

Online optimization techniques are known in a field of decision making, such as policy determination in a marketing field. In the online optimization, an optimal policy is determined based on setting that a value of an objective function is acquired each time a certain policy is executed. Moreover, in reality, there is a case in which values of the objective function are acquired only partially in the online optimization (a bandit problem). Specifically, when a certain policy A is executed, a value of an objective function associated to the policy A (e.g., a reward acquired by executing the policy A) can be acquired. However, a value of the objective function to be acquired when executing a policy B is unknown. Therefore, there has been a technique of online optimization in a case where values of the objective function can be acquired only partially in a linear function. Furthermore, Non Patent Literature 1 discloses a technique related to online optimization of a policy for a non-linear function.


Non Patent Literature 2 discloses a theory of linear and integer programming. Non Patent Literature 3 discloses a technique related to bandit convex optimization. Non Patent Literature 4 discloses a technique related to the geometry of logarithmic concave functions and sampling algorithms. Non Patent Literature 5 discloses a statistical study on logarithmic concavities and strong logarithmic concavities. Non Patent Literature 6 discloses a technique related to a multiplicative weights update method. Non Patent Literature 7 discloses a technique related to optimization of an approximately convex function.


CITATION LIST
Non Patent Literature



  • [Non Patent Literature 1] A. D. Flaxman, A. T. Kalai, and H. B. McMahan, “Online convex optimization in the bandit setting: gradient descent without a gradient” [online], 30 Nov. 2004, [Search on Oct. 2, 2019], Internet <URL: http://www.cs.cmu.edu/-mcmahan/soda2005.pdf>

  • [Non Patent Literature 2] A. Schrijver, Theory of linear and integer programming, John Wiley & Sons, 1998.

  • [Non Patent Literature 3] E. Hazan and K. Levy, Bandit convex optimization: Towards tight bounds, In Advances in Neural Information Processing Systems, pages 784-792, 2014.

  • [Non Patent Literature 4] L. Lovasz and S. Vempala, The geometry of logconcave functions and sampling algorithms, [online], March 2005, [Search on Oct. 2, 2019], Internet <URL: https://www.cc.gatech.edu/-vempala/papers/logcon.pdf>

  • [Non Patent Literature 5] A. Saumard and J. A. Wellner, Log-concavity and strong log-concavity: a review, Statistics surveys, 8:45, 2014.

  • [Non Patent Literature 6] S. Arora, E. Hazan, and S. Kale, The multiplicative weights update method: a meta-algorithm and applications, Theory of Computing, 8(1):121-164, 2012.

  • [Non Patent Literature 7] A. Belloni, T. Liang, H. Narayanan, and A. Rakhlin, Escaping the local minima via simulated annealing: Optimization of approximately convex functions, In Conference on Learning Theory, pages 240-265, 2015.



SUMMARY OF INVENTION
Technical Problem

However, the technique according to Non Patent Literature 1 has insufficient precision in online optimization in a case where values of an objective function are only partially acquired for the non-linear objective function.


The present disclosure has been made in order to solve such a problem, and an object thereof is to provide an optimization apparatus, an optimization method, and an optimization program for achieving high-precision optimization in online optimization in a case where values of an objective function are only partially acquired for a non-linear objective function.


Solution to Problem

An optimization apparatus according to a first aspect of the present disclosure includes:

    • a setting unit that sets a predetermined non-linear objective function;
    • a policy determination unit that determines a policy to be executed in online optimization in a bandit problem, based on the non-linear objective function;
    • a policy execution unit that acquires a reward as an execution result of the determined policy;
    • an update rate determination unit that determines an update rate of the non-linear objective function by a multiplicative weight update method, based on the acquired reward and the non-linear objective function; and
    • an update unit that updates the non-linear objective function, based on the update rate.


An optimization method according to a second aspect of the present disclosure includes,

    • by a computer:
    • setting a predetermined non-linear objective function;
    • determining a policy to be executed in online optimization in a bandit problem, based on the non-linear objective function;
    • acquiring a reward as an execution result of the determined policy;
    • determining an update rate of the non-linear objective function by a multiplicative weight update method, based on the acquired reward and the non-linear objective function; and
    • updating the non-linear objective function, based on the update rate.


An optimization program according to a third aspect of the present disclosure causes a computer to execute:

    • setting processing of setting a predetermined non-linear objective function;
    • policy determination processing of determining a policy to be executed in online optimization in a bandit problem, based on the non-linear objective function;
    • policy execution processing of acquiring a reward as an execution result of the determined policy;
    • update rate determination processing of determining an update rate of the non-linear objective function by a multiplicative weight update method, based on the acquired reward and the non-linear objective function; and
    • update processing of updating the non-linear objective function, based on the update rate.


Advantageous Effects of Invention

According to the present invention, it is possible to provide an optimization apparatus, an optimization method, and an optimization program for achieving high-precision optimization in online optimization in a case where values of an objective function are only partially acquired for a non-linear objective function.





BRIEF DESCRIPTION OF THE DRAWINGS


FIG. 1 is a block diagram illustrating a configuration of an optimization apparatus according to a first example embodiment.



FIG. 2 is a flowchart illustrating a flow of an optimization method according to the first example embodiment.



FIG. 3 is a block diagram illustrating a configuration of an optimization apparatus according to a second example embodiment.



FIG. 4 is a flowchart illustrating a flow of an optimization method according to the second example embodiment.





DESCRIPTION OF EMBODIMENTS

Hereinafter, example embodiments of the present disclosure will be described in detail with reference to the drawings. In the drawings, the same or associated elements are denoted by the same reference numerals, and duplicate descriptions are omitted as necessary for clarity of description.


First Example Embodiment


FIG. 1 is a block diagram illustrating a configuration of an optimization apparatus 100 according to the first example embodiment. The optimization apparatus 100 is an information processing apparatus that performs online optimization in a bandit problem. Herein, the bandit problem is a problem in which a case where a content of an objective function changes every time a solution (policy) is executed by using the objective function, and only a value (reward) of an objective function in a selected solution can be observed is set. Thus, online optimization in the bandit problem is online optimization when values of the objective function are only partially acquired.


The optimization apparatus 100 includes a setting unit 110, a policy determination unit 120, a policy execution unit 130, an update rate determination unit 140, and an update unit 150. The setting unit 110 sets a predetermined non-linear objective function. The policy determination unit 120 determines a policy to be executed in the online optimization in the bandit problem, based on the non-linear objective function. The policy execution unit 130 acquires a reward as an execution result of the determined policy. The update rate determination unit 140 determines an update rate of a non-linear objective function by a multiplicative weight update method, based on the acquired reward and the non-linear objective function. Herein, the multiplicative weight update method is, for example, a method disclosed in Non Patent Literature 6. The update unit 150 updates the non-linear objective function, based on the update rate.



FIG. 2 is a flowchart illustrating a flow of the optimization method according to the first example embodiment. First, the setting unit 110 sets a predetermined non-linear objective function (S1). Next, the policy determination unit 120 determines a policy to be executed in the online optimization in the bandit problem, based on the non-linear objective function (S2). Then, the policy execution unit 130 acquires a reward as an execution result of the determined policy (S3). Subsequently, the update rate determination unit 140 determines an update rate of the non-linear objective function by the multiplicative weight update method, based on the acquired reward and the non-linear objective function (S4). Thereafter, the update unit 150 updates the non-linear objective function, based on the update rate (S5).


As described above, in the present example embodiment, in the online optimization in a case where values of the objective function are only partially acquired for the non-linear objective function, the update rate by the multiplicative weight update method is determined from the determined policy, and the non-linear objective function is updated by the update rate. Therefore, high-precision optimization can be achieved.


The optimization apparatus 100 includes a processor, a memory, and a storage device as a configuration not illustrated. The storage device stores a computer program in which processing of the optimization method according to the present example embodiment is implemented. The processor then causes a computer program to be read from the storage device into the memory and executes the computer program. As a result, the processor achieves functions of the setting unit 110, the policy determination unit 120, the policy execution unit 130, the update rate determination unit 140, and the update unit 150.


Alternatively, each of the setting unit 110, the policy determination unit 120, the policy execution unit 130, the update rate determination unit 140, and the update unit 150 may be achieved by dedicated hardware. In addition, a part or all of each component of each apparatus may be achieved by a general-purpose or special-purpose circuitry, a processor, or the like, or a combination thereof. These may be configured by a single chip, or may be configured by a plurality of chips connected via a bus. A part or all of each component of each apparatus may be achieved by a combination of the above-described circuitry or the like with a program. As the processor, a central processing unit (CPU), a graphics processing unit (GPU), a field-programmable gate array (FPGA), or the like can be used.


When a part or all of the components of the optimization apparatus 100 are achieved by a plurality of information processing apparatuses, circuitries, and the like, the plurality of information processing apparatuses, circuitries, and the like may be concentratedly arranged or dispersedly arranged. For example, the information processing apparatus, the circuitry, and the like may be achieved as a client-server system, a cloud computing system, and the like, each of which is connected via a communication network. Functions of the optimization apparatus 100 may be provided in a software as a service (SaaS) format.


Second Example Embodiment

Bandit convex optimization (BCO) is an online decision-making framework with limited (partial) feedback. In this framework, a player is given a convex feasible region K satisfying K⊆Rd and the number of repetition times T of decision making. Herein, d is a positive number indicating the number of dimensions of the feasible region. For each of the numbers of repetition times t=1, 2, . . . T, the player selects an action (policy) at∈K and at the same time, the circumstance selects a convex function ft. The player observes feedback of the ft (at) prior to selecting the following policy at+1. Herein, it is assumed that K is a positive amount, i.e.,

x∈Kldx>0.  [Math. 1]


It is assumed that ft is σ-strongly convex and β-smooth. In other words, the following expressions (1) and (2) are satisfied for all x and y∈K:









[

Math
.

2

]












f
t

(
y
)





f
t

(
x
)

+






f
t

(
x
)

T




(

y
-
x

)


+


σ
2






y
-
x



2
2




,




(
1
)












[

Math
.

3

]











f
t

(
y
)





f
t

(
x
)

+






f
t

(
x
)

T




(

y
-
x

)


+


β
2







y
-
x



2
2

.







(
2
)







Performance of the player is evaluated with respect to Regret RT (x*), which is an evaluation index. Herein, Regret RT (x*) is defined by the following equation (3) with respect to x*∈K:









[

Math
.

4

]











R
T

(

x
*

)

=





t
=
1

T




f
t

(

a
t

)


-




t
=
1

T




f
t

(

x
*

)

.







(
3
)







In the present disclosure, as for a player capability, a convex benchmark set K′⊆K is selected (employed). Herein, the Regret RT (x*) for x*∈K′ is considered. In other words, A value of

supx*∈K′E[RT(x*)]  [Math. 5]

and a gap to be expected between a cumulative loss on outputs of algorithms and an optimal independent policy x* belonging to K′ are noted. Note that E[ ] is an expected value. When the optimal independent policy x* that satisfies the following belongs to K′, a value of










x
*




arg


min

x

K







t
=
1

T



f
t

(
x
)







[

Math
.

6

]














sup

x
*





K






E

[


R
T

(

x
*

)

]





[

Math
.

7

]








is equal to a standard worst-case Regret,

supx*∈KE[RT(x*)]  [Math. 8]


When ∥y−x∥2≤γ implies y∈K, x∈Rd is γ-interiors of K. For example, when K is expressed by an m-linear inequality, i.e., K is expressed as

K={x∈custom characterd|ajTx≤bj (j∈[m])},  [Math. 9]

the convex set K′ defined by

K′={x∈custom characterd|ajTx≤bj−r (j∈[m])}  [Math. 10]

consists of r-interiors of K.


For a generic benchmark set K′⊆K, let r≥0 be a non-negative real value, where all members of K′ are r-interiors. For a special case at K′=K, r is equal to zero. It is further assumed that there is a positive number R>0 where l2 norm of any element of K′ is at most R.


It also allows access to a membership oracle for K′. This means that x∈Rd is given, and it can be determined whether x∈K′ by calling the membership oracle. When K′ is expressed by an m inequality

K′={x∈custom characterd|gj(x)≤0 (j∈[m])},  [Math. 11]


there is an access to a membership oracle for K′. This is because by evaluating gi(x) for i∈[m], it can be checked whether x∈K. Furthermore, when a linear optimization problem on K′ in polynomial time can be solved, it is known from Non Patent Literature 2 to have a membership oracle of polynomial time for K′.


<Notation>


For a vector x=(x1, . . . , xd)T∈Rd, let the l2 norm of x is assumed to be ∥x∥2. In other words, it is as follows.

∥x∥2=√{square root over (xTx)}=√{square root over (Σi=1dxi2)}  [Math. 12]

Let l2-operator norm be ∥X∥2 for a matrix X∈Rd×d. In other words, ∥X∥2=max{∥Xy∥2|y∈Rd, ∥y∥2=1}. When X is a symmetric matrix, ∥X∥2 is equal to the largest absolute value of an eigenvalue of X. Given a positive half constant sign matrix A∈Rd×d and the vector x∈Rd, ∥x∥A is defined as follows:

∥x∥A=√{square root over (xTAx)}=∥A1/2x∥2.  [Math. 13]

Similarly, for the matrix XERd, the following is assumed to be ∥X∥A:

∥X∥A=∥A1/2XA1/22.  [Math. 14]

<Smoothed Convex Function>


Let v and u be random variables that follow uniform distribution on each of Bd={v∈Rd | ∥v∥2≤1} and Sd={u∈Rd | ∥u∥2=1}. For a convex function f and a regular matrix B∈Rd×d on the Rd, a smoothing function f{circumflex over ( )} is defined by the following equation (4):

{circumflex over (f)}B(x)=E[f(x+Bv)]  [Math. 15] . . . (4).

<Auxiliary Theorem 1 (Non Patent Literature 3)>


A gradient of f{circumflex over ( )}B is expressed by the following equation (5):

{circumflex over (f)}B(x)=E[d·f(x+Bv)B−1u]  [Math. 16] . . . (5).

When f is β-smooth, the following expression (6) is held:









[

Math
.

17

]










0





f
B

^

(
x
)

-

f

(
x
)





β
2







B
T


B



2



=


β
2





λ
1

(


B
T


B

)

.






(
6
)








When f is σ-strongly convex, f{circumflex over ( )}B is also σ-strongly convex.


The equation (5) is illustrated by the Stokes' theorem, and the expression (6) is derived from the definition of (β-smooth. In bandit feedback setting, even though an unbiased estimated value of a gradient of ft cannot be utilized, those for the smoothed f{circumflex over ( )}t can be constructed based on the equation (5). A difference between ft and f{circumflex over ( )}t may be bounded by the expression (6).


<Log-Concave Distribution>


A probability distribution on a convex set K⊆Rd is called a log-concave distribution when its probability density function p:K→R is expressed as p(x)=exp(−g(x)) by using a convex function g:K→R, where a logarithm of p(x) is a concave function. The algorithm of the present disclosure maintains a log-concave distribution. Random samples from log-concave distributions can be efficiently generated with mild assumptions. In fact, there is a computationally efficient MCMC algorithm for sampling from p in a study that is given a membership oracle for K and an evaluation oracle for g, as illustrated in Non Patent Literature 4. Thus, the present disclosure can efficiently compute an estimated value of a covariance matrix Cov(p) for a mean μ(p) and p. The following auxiliary theorems are useful when limiting variables of the log-concave distribution.


<Auxiliary Theorem 2 (Prop. 10.1 of Non Patent Literature 5)>


It is assumed that a logarithmic concave distribution on K has a probability density function p(x)=exp(−g(x)). Herein, g is a σ-strongly convex function. At this time, a covariance matrix Σ of p satisfies ∥Σ∥2≤1/σ.


In order to ensure that an output at of the algorithm of the present disclosure is included in K, the following auxiliary theorem is used.


<Auxiliary Theorem 3 (Non Patent Literature 4)>


Let p be a logarithmic concave distribution on K. At this time, the following ellipsoid is included in K:

{x∈custom characterd|∥x−μ(p)∥Cov(p)−1≤1/e}  [Math. 18]

<Configuration of Optimization Apparatus>



FIG. 3 is a block diagram illustrating a configuration of an optimization apparatus 200 according to the second example embodiment. The optimization apparatus 200 is an information processing apparatus which is a specific example of the optimization apparatus 100 described above. The optimization apparatus 200 includes a storage unit 210, a memory 220, an interface (IF) unit 230, and a control unit 240.


The storage unit 210 is a storage device such as a hard disk or a flash memory. The storage unit 210 stores an exploration parameter 211 and an optimization program 212. The exploration parameter 211 is a parameter used when determining a policy for each round, which will be described later. As the exploration parameter 211, a parameter group calculated for each round may be input from outside and stored. Alternatively, the exploration parameter 211 may be calculated for each round in the optimization apparatus 200. The optimization program 212 is a computer program in which the optimization method according to the present example embodiment is implemented.


The memory 220 is a volatile storage device such as a random access memory (RAM), and is a storage area for temporarily holding information during an operation of the control unit 240. The IF unit 230 is an interface for inputting and outputting data to and from the outside of the optimization apparatus 200. For example, the IF unit 230 receives input data from another computer or the like via a network (not illustrated), and outputs the received input data to the control unit 240. The IF unit 230 outputs data to a destination computer via a network in response to an instruction from the control unit 240. Alternatively, the IF unit 230 receives a user operation via an input device (not illustrated) such as a keyboard, a mouse, or a touch panel, and outputs the received operation content to the control unit 240. In addition, the IF unit 230 performs the output to a touch panel, a display device, a printer, or the like (not illustrated), in response to an instruction from the control unit 240.


The control unit 240 is a processor such as a central processing unit (CPU), and controls each configuration of the optimization apparatus 200. The control unit 240 reads the optimization program 212 from the storage unit 210 into the memory 220, and executes the optimization program 212. As a result, the control unit 240 achieves functions of the setting unit 241, the policy determination unit 242, the policy execution unit 243, the update rate determination unit 244, and the update unit 245. Each of the setting unit 241, the policy determination unit 242, the policy execution unit 243, the update rate determination unit 244, and the update unit 245 is an example of the setting unit 110, the policy determination unit 120, the policy execution unit 130, the update rate determination unit 140, and the update unit 150 which have been described above.


The setting unit 241 performs initial setting of a predetermined non-linear objective function. In addition, the setting unit 241 receives input of the exploration parameter 211 or a calculation formula of the exploration parameter 211 from the outside as necessary, and stores the received data in the storage unit 210.


The policy determination unit 242 determines a policy to be executed in the online optimization in the bandit problem, based on the nonlinear objective function. Herein, the policy determination unit 242 determines a policy by further using the exploration parameter 211. Furthermore, the policy determination unit 242 may calculate (update, select) the exploration parameter 211, based on the number of trials of updating in the non-linear objective function, and may determine the policy by further using the calculated exploration parameter 211. Further, the policy determination unit 242 may calculate the exploration parameter 211, based on a distance from a boundary of a feasible region. For example, the policy determination unit 242 calculates the exploration parameter 211 by using the above input calculation formula for each round. Alternatively, the policy determination unit 242 may use the exploration parameter 211 calculated outside in advance in accordance with the number of rounds. In addition, the policy determination unit 242 calculates a mean value and a covariance matrix of a plurality of samples generated based on the estimated value of the non-linear objective function, and determines a policy by further using the mean value and the covariance matrix.


The policy execution unit 243 executes the policy determined by the policy determination unit 242 as an input of a non-linear objective function, and acquires a reward as an execution result (value of the objective function).


The update rate determination unit 244 determines an update rate of the non-linear objective function by the multiplicative weight update method, based on the reward acquired by the policy execution unit 243 and the non-linear objective function.


The update unit 245 updates the non-linear objective function, based on the update rate determined by the update rate determination unit 244.


<Flow of Optimization Method>



FIG. 4 is a flowchart illustrating a flow of the optimization method according to the second example embodiment. First, as a precondition, an upper limit value T∈N of rounds (the number of repetition times), a membership oracle MO for a learning rate η>0, and a strong-convexity parameter σ>0 are assumed. It is also assumed that an exploration parameter αt satisfies the following:

t}t=1Tcustom character>0.  [Math. 19]


In addition, the optimization apparatus 200 holds a function zt on K′, based on a multiplicative weights update method according to Non Patent Literature 6 in the memory 220.


First, the setting unit 241 initializes zt of (an estimated value of) the non-linear objective function as described below (S201).












z
1

(
x
)

=

σ





x


2
2

2



,




[

Math
.

20

]








where pt is a probability distribution on K′ using a density proportional to exp(−ηzt(x)) for each round. In other words, Zt and pt are defined by the following expression (7):









[

Math
.

21

]











Z
t

:=




x


K







exp

(

-



η

z

t

(
x
)


)



dz



,




p
t

(
x
)

=



exp

(

-



η

z

t

(
x
)


)


Z
t


.






(
7
)







Next, the control unit 240 increments t by one from round t=1 to T, and repeats the following steps S203 to S210 (S202).


First, the policy determination unit 242 generates an xt(M) from a sample xt(1) by pt. Herein, M is the number of samples generated and M≥1. Note that a method of generating samples from lot will be described later. Next, the policy determination unit 242 calculates the mean estimated value μ{circumflex over ( )}t and an estimated value Σ{circumflex over ( )}t of the covariance matrix by the following equation (9) from the generated sample xt(1) to the xt(M) in such a way as to satisfy the following expression (8) (S203). Note that μt and Σt represent a mean and a covariance matrix for pt.









[

Math
.

22

]
















μ
^

t

-

μ
t






t

-
1





1
/
9


,







Σ
^

t

-

Σ
t






t

-
1





1
/
9


,


E
[



μ
^

t



μ
t


]

=


μ
t

.






(
8
)












[

Math
.

23

]












μ
^

t

=


1
M






j
=
1

M


x
t

(
j
)





,



Σ
^

t

=




1
M






j
=
1




M





(


x
t

(
j
)


-


μ
^

t


)




(


x
t

(
j
)


-


μ
^

t

(
j
)



)

.








(
9
)








Herein, when M is set to be sufficiently large, the expression (8) is maintained with a high probability.


Next, the policy determination unit 242 calculates a matrix Bt. Herein, the policy determination unit 242 calculates a matrix Bt∈Rd×d in such a way as to satisfy the following:

BtTBt={circumflex over (Σ)}t.  [Math. 24]

For example, the policy determination unit 242 can calculate the matrix Bt by Cholesky decomposition algorithms.


The policy determination unit 242 selects the exploration parameter at (S205). A method of selecting the exploration parameter αt will be described later.


Then, the policy determination unit 242 smooths a function ft by using B=αtBt in the above equation (4), and calculates an expected value f{circumflex over ( )}t of the function by the following expression (10):

{circumflex over (f)}t(x):=E[ft(x+αtBtv)]  [Math. 25] . . . (10).

Herein, v is uniformly distributed on Bd={v∈Rd| ∥v∥2≤1} as described above.


Then, the policy determination unit 242 randomly and uniformly selects ut from a unit sphere Sd={u∈Rd| ∥u∥2=1} (S206). Then, the policy determination unit 242 calculates (determines) a policy at by atttBtut by using a mean μt, the exploration parameter αt, the matrix Bt and the selected ut (S207).


Then, the policy execution unit 243 executes the determined policy at and acquires the execution result, specifically, the policy execution unit 243 inputs the policy at to a function ft(x), and observes the value ft (at) of the function to be outputted (S208).


Based on this observation, the update rate determination unit 244 calculates an update rate g{circumflex over ( )}t∈Rd by the following equation (11) (S209):

ĝt=d·ft(at)(αtBt)−1ut  [Math. 26] . . . (11).

This is a random estimated value of a gradient ∇f{circumflex over ( )}t(μ{circumflex over ( )}t). In other words, since μ{circumflex over ( )}t and Bt are given, a conditional expected value of g{circumflex over ( )}t satisfies the following equation (12):

E[ĝt]=E[d·ft({circumflex over (μ)}ttBtut)(αtBt)−1ut]=∇{circumflex over (f)}t({circumflex over (μ)}t)  [Math. 27] . . . (12),

where a second inequality is acquired from the above equation (5).


The update unit 245 updates zt by using the random estimated value g{circumflex over ( )}t as illustrated in the following equation (13) (S210):









[

Math
.

28

]











z

t
+
1


(
x
)

=



z
t

(
x
)

+



g
ˆ

t
T

(

x
-


μ
ˆ

t


)

+


σ
2







x
-


μ
ˆ

t




2
2

.







(
13
)







Thereafter, the control unit 240 determines whether the round t=T (S211), and when t is less than T, the control unit 240 returns to the step S202, performs t=t+1, and executes the step S203 and subsequent steps. In the step S211, when t is T, the present optimization processing ends.


<Examples of Method of Generating Samples from pts>


A simple example of a method of generating samples from pt is to use a normal distribution. Herein, since pt is defined by the above z1 (x), expression (7), and equation (13), the distribution pt is a multi-dimensional cut normal distribution on K expressed as follows:












p
t

(
x
)




exp
(

-



ση

t






x
-

θ
t




2
2


2


)



(

x



K



)



,




[

Math
.

29

]












p
t

(
x
)

=

0



(

x






d



K




)



,
where










θ
t

=


1
t






j
=
1


t
-
1




(



μ
^

j

-


1
σ




g
^

j



)

.







[

Math
.

30

]








Therefore, by sampling x from a normal distribution









N

(


θ
t

,


1

ση

t



I


)




[

Math
.

31

]








to x∈K′, x can be acquired following pt.


However, although the above processing is sufficiently practical in many cases, it does not necessarily end in polynomial time. In such a case, since pt is a log-concave distribution, a polynomial temporal sampling method based on MCMC (Non Patent Literature 4) can be applied instead of the above processing. At this time, the membership oracle may be called. Note that as for a more efficient method of computing μ{circumflex over ( )} and Σ{circumflex over ( )} and sampling from pt the technique of Non Patent Literature 7 can be used.


<Example of Method of Selecting Exploration Parameter αt>


In the step S205 described above, the policy determination unit 242 needs to select αt in such a way that at=μ{circumflex over ( )}ttBtut is an executable solution, i.e., at∈K. The following proposal provides sufficient conditions for this,


<Proposal>


When αt is bounded by










0
<

α
t




1
9

+

r




t

ησ

2





,




[

Math
.

32

]








at=μ{circumflex over ( )}ttBtut is within K.


<Proof>


αt1 and αt2 is assumed to be

αt1≤ 1/9, αt2≤r√{square root over (tησ/2)}  [Math. 33]

and a positive number such as αtt1t2. at is expressed as at=μ{circumflex over ( )}tt1Btutt2Btut, and since all points of K′ are r-interiors of K, it suffices to indicate (i) μ{circumflex over ( )}tt1Btut∈K′ and (ii) ∥αt2Btut2≤r.


From Auxiliary Theorem 3,

∥{circumflex over (μ)}tt1Btut−μt93 t−1≤1/e  [Math. 34]

implies μ{circumflex over ( )}tt1Btut∈K′. From the triangular inequality, expression (14) is provided.









[

Math
.

35

]
















μ
ˆ

t

+


α

t

1




B
t



u
t


-

μ
t






t

-
1















μ
ˆ

t

-

μ
t






t

-
1



+


α

t

1








B
t



u
t






t

-
1












1
9

+


1
9







Σ
t


-
1

/
2




B
t



u
t




2










1
9





(



1


+









Σ
t


-
1

/
2






B

t




2


)







(


14


)








From the expression (8),

∥Σt−1/2Bt2≤2  [Math. 36]

is led. By combining this with the expression (14), the following implications for holding (i) is acquired:

∥{circumflex over (μ)}tt1Btut−μtΣt−1≤⅓≤1/e  [Math. 37]

Herein, since ηzt(x) is a (tησ)-strongly convex function, a covariance matrix Σt=Cov(pt)

∥Σt2≤1/(tησ)  [Math. 38]

is defined as the boundary from the Auxiliary Theorem 2. From this and the expression (8),

∥{circumflex over (Σ)}t2≤2/(tησ)  [Math. 39]

is acquired. Therefore, the following expression (15) is acquired:

Bt2≤√{square root over (∥{circumflex over (Σ)}t2)}≤√{square root over (2/(tησ))}  [Math. 40] . . . (15).

From the expression (15) and

ut∥−1, and αt≤r√{square root over (tησ/2)},  [Math. 41]

(ii) is denoted.


Based on the above, it is desirable that the exploration parameter αt is selected by the following equation (16). The equation (16) is an example of calculation formulas of the exploration parameter αt described above:









[

Math
.

42

]










α
t

=

min



{



1
9

+

r




t

ησ

2




,


d


}

.






(
16
)








Namely, the exploration parameter at is calculated based on the round t. For example, in the step S205, the policy determination unit 242 may calculate the exploration parameter αt by applying the current round t to the equation (16). Alternatively, when the exploration parameter αt for each round t is calculated in advance outside the optimization apparatus 200 and already acquired as the exploration parameter 211, the policy determination unit 242 may select and read out the exploration parameter αt associated to the round t at that time from the exploration parameters 211 in the storage unit 210.


Further, r is less than the shortest distance from a boundary of the feasible region K. In other words, it indicates that a circle having a radius r centered on the exploration parameter αt exists in the feasible region K. Also, r can be referred to as an index indicating how far the optimal solution of α exists inside the feasible region K. In addition, it can be said that the policy determination unit 242 applies the rounds t and r at that time to the equation (16) and calculates the exploration parameter at in the step S205.


<Effect>


The Regret Limit in the above-mentioned Non Patent Literature 1 is

Õ(d2/3T2/3),  [Math. 43]

but a Regret Limit in this disclosure is

Õ(d√{square root over (T)}).  [Math. 44]


Non Patent Literature 3 has failed to efficiently construct self-concordant barriers for general convex sets. With a v-self-concordant barrier, the Regret Limit implies that there is a gap of

√{square root over (v)}  [Math. 45]

between the upper and lower limits, and therefore,

Õ(d√{square root over (vT)}).  [Math. 46]


Because v is generally at least d for any compression convex set K, from the lower limit of

O(d√{square root over (T)}),  [Math. 47]
there is a gap of
Ω(√{square root over (d)}).  [Math. 48]

Also, this is because, when there is a self-concordant barrier with a small v, for example, when K is expressed by m(>>d) linear inequality, the gap is even worse.


In contrast, the present disclosure can overcome the above-mentioned problems.


(i) Under mild assumptions, the algorithm of the present disclosure is a minimax optimal factor up to a logarithm.

Õ(d√{square root over (T)})  [Math. 49]

Regret is accomplished. The result is a first rigid boundary for the bandit convex optimization that applies to the constraint problem. Given the assumption that more accurately, optimal solutions exist in r-interiors, the algorithms of the present disclosure acquire a Regret Limit of

Õ(d√{square root over (T)}+d2/r2).  [Math. 50]


Moreover, even in a case of the absence of interior assumptions, the algorithm has a Regret Limit of

Õ(d3/2√{square root over (T)}),  [Math. 51]

which is at least better than known algorithms.


(ii) The algorithm of the present disclosure does not require a self-concordant barrier. In fact, it is assumed that it has access to the membership oracle for a feasible region. This means that even if K is expressed by an exponentially large number of linear inequalities, or is a record given a known obvious form of K, the algorithm of the present disclosure works well.


In addition, efficient algorithms for sampling from logarithmic concave distributions can be executed in polynomial time.


Other Example Embodiments

In the above example embodiment, the hardware configuration has been described, but the present invention is not limited thereto. The present disclosure is also able to achieve any processing by causing a central processing unit (CPU) to execute a computer program.


In the above examples, programs may be stored and provided to a computer by using various types of non-transitory computer readable media. Non-transitory computer readable media include various types of tangible storage media. Examples of the non-transitory computer readable media include a magnetic recording medium (e.g., a flexible disk, a magnetic tape, a hard disk drive), a magneto-optical recording medium (e.g., a magneto-optical disk), a read only memory (CD-ROM), a CD-R, a CD-R/W, a digital versatile disc (DVD), and a semiconductor memory (e.g., a mask ROM, a programmable ROM (PROM), an erasable PROM (EPROM), a flash ROM, a random access memory (RAM)). The program may also be supplied to the computer by various types of transitory computer readable media. Examples of the transitory computer-readable media include electrical signals, optical signals, and electromagnetic waves. The transitory computer readable medium may provide the program to the computer via wired communication paths, such as an electrical wire and an optical fiber, or a wireless communication path.


Note that the present disclosure is not limited to the above-mentioned example embodiments, and can be modified as appropriate within a range not deviating from the gist. The present disclosure may be implemented by appropriately combining respective example embodiments.


A part or all of the above example embodiments may also be described as the following supplementary notes, but are not limited to the following.


(Supplementary Note A1)


An optimization apparatus including:

    • a setting unit that sets a predetermined non-linear objective function;
    • a policy determination unit that determines a policy to be executed in online optimization in a bandit problem, based on the non-linear objective function;
    • a policy execution unit that acquires a reward as an execution result of the determined policy;
    • an update rate determination unit that determines an update rate of the non-linear objective function by a multiplicative weight update method, based on the acquired reward and the non-linear objective function; and
    • an update unit that updates the non-linear objective function, based on the update rate.


      (Supplementary Note A2)


The optimization apparatus according to supplementary note A1, wherein the policy determination unit determines the policy by further using a predetermined exploration parameter.


(Supplementary Note A3)


The optimization apparatus according to supplementary note A1, wherein the policy determination unit

    • calculates the exploration parameter, based on the number of trials of updating in the non-linear objective function, and
    • determines the policy by further using the calculated exploration parameter.


      (Supplementary Note A4)


The optimization apparatus according to supplementary note A3, wherein the policy determination unit

    • further calculates the exploration parameter, based on a distance from a boundary of a feasible region.


      (Supplementary Note A5)


The optimization apparatus according to any one of supplementary notes A1 to A4, wherein the policy determination unit

    • calculates a mean value and a covariance matrix of a plurality of samples that are generated based on an estimated value of the non-linear objective function, and
    • determines the policy by further using the mean value and the covariance matrix.


      (Supplementary Note B1)


An optimization method including,

    • setting a predetermined non-linear objective function;
    • determining a policy to be executed in online optimization in a bandit problem, based on the non-linear objective function;
    • acquiring a reward as an execution result of the determined policy;
    • determining an update rate of the non-linear objective function by a multiplicative weight update method, based on the acquired reward and the nonlinear objective function; and
    • updating the non-linear objective function, based on the update rate.


      (Supplementary Note C1)


A non-transitory computer readable medium having stored therein an optimization program for causing a computer to execute:

    • setting processing of setting a predetermined non-linear objective function;
    • policy determination processing of determining a policy to be executed in online optimization in a bandit problem, based on the non-linear objective function;
    • policy execution processing of acquiring a reward as an execution result of the determined policy;
    • update rate determination processing of determining an update rate of the non-linear objective function by a multiplicative weight update method, based on the acquired reward and the non-linear objective function; and
    • update processing of updating the non-linear objective function, based on the update rate.


The present invention has been described above with reference to example embodiments (and examples), but the present invention is not limited to the above example embodiments (and examples). Various modifications can be made to the structure and details of the present invention which can be understood by a person skilled in the art within the scope of the present invention.


REFERENCE SIGNS LIST






    • 100 Optimization apparatus


    • 110 Setting unit


    • 120 Policy determination unit


    • 130 Policy execution unit


    • 140 Update rate determination unit


    • 150 Update unit


    • 200 Optimization apparatus


    • 210 Storage unit


    • 211 Exploration parameter


    • 212 Optimization program


    • 220 Memory


    • 230 IF unit


    • 240 Control unit


    • 241 Setting unit


    • 242 Policy determination unit


    • 243 Policy execution unit


    • 244 Update rate determination unit


    • 245 Update unit




Claims
  • 1. An optimization apparatus comprising: at least one memory configured to store instructions; andat least one processor configured to execute the instructions to:set a predetermined non-linear objective function;repeat a plurality of times:determine a policy to be executed in online optimization in a bandit problem in which values thereof of the predetermined non-linear objective function are only partially acquired, based on the predetermined non-linear objective function; execute the determined policy;acquire a reward, as an execution result of the determined policy;determine an update rate of the predetermined non-linear objective function by a multiplicative weight update method, based on the acquired reward and the predetermined non-linear objective function; andupdate the predetermined non-linear objective function, based on the update rate, such that the predetermined non-linear objective function as updated is used a next time the policy is determined,wherein updating of the predetermined non-linear objective function each of the plurality of times achieves high precision online optimization of the determined policy that is executed.
  • 2. The optimization apparatus according to claim 1, wherein the at least one processor is further configured to execute the instructions to: determine the policy by further using an exploration parameter.
  • 3. The optimization apparatus according to claim 2, wherein the at least one processor is further configured to execute the instructions to: calculate the exploration parameter, based on a number of the plurality of times, anddetermine the policy by further using the calculated exploration parameter.
  • 4. The optimization apparatus according to claim 3, wherein the at least one processor is further configured to execute the instructions to: further calculate the exploration parameter, based on a distance from a boundary of a feasible region.
  • 5. The optimization apparatus according to claim 1, wherein the at least one processor is further configured to execute the instructions to: calculate a mean value and a covariance matrix of a plurality of samples that are generated based on an estimated value of the non-linear objective function, and determine the policy by further using the mean value and the covariance matrix.
  • 6. An optimization method comprising, by a computer: setting a predetermined non-linear objective function;repeating a plurality of times:determining a policy to be executed in online optimization in a bandit problem in which values thereof of the predetermined non-linear objective function are only partially acquired, based on the predetermined non-linear objective function; executing the determined policy;acquiring a reward, as an execution result of the determined policy;determining an update rate of the predetermined non-linear objective function by a multiplicative weight update method, based on the acquired reward and the predetermined non-linear objective function; andupdating the predetermined non-linear objective function, based on the update rate, such that the predetermined non-linear objective function as updated is used a next time the policy is determined,wherein updating of the predetermined non-linear objective function each of the plurality of times achieves high precision online optimization of the determined policy that is executed.
  • 7. A non-transitory computer-readable medium storing an optimization program executable by a computer to perform: setting a predetermined non-linear objective function;repeating a plurality of times:determining a policy to be executed in online optimization in a bandit problem in which values thereof of the predetermined non-linear objective function are only partially acquired, based on the predetermined non-linear objective function; executing the determined policy;acquiring a reward, as an execution result of the determined policy;determining an update rate of the predetermined non-linear objective function by a multiplicative weight update method, based on the acquired reward and the predetermined non-linear objective function; andupdating the predetermined non-linear objective function, based on the update rate, such that the predetermined non-linear objective function as updated is used a next time the policy is determined,wherein updating of the predetermined non-linear objective function each of the plurality of times achieves high precision online optimization of the determined policy that is executed.
PCT Information
Filing Document Filing Date Country Kind
PCT/JP2019/039519 10/7/2019 WO
Publishing Document Publishing Date Country Kind
WO2021/070229 4/15/2021 WO A
US Referenced Citations (5)
Number Name Date Kind
11669768 Theocharous Jun 2023 B2
20050256778 Boyd Nov 2005 A1
20150081393 Cohen Mar 2015 A1
20180220061 Wang Aug 2018 A1
20180307986 Kabul Oct 2018 A1
Foreign Referenced Citations (3)
Number Date Country
2012-141683 Jul 2012 JP
2016-122241 Jul 2016 JP
2020-009283 Jan 2020 JP
Non-Patent Literature Citations (11)
Entry
Reverdy et al., Parameter Estimation in Softmax Decision-Making Models with Linear Objective Functions, Jan. 2016, IEEE vol. 13 (Year: 2016).
JP Office Action for JP Application No. 2021-550960, dated May 2, 2023 with English Translation.
International Search Report for PCT Application No. PCT/JP2019/039519, dated Dec. 10, 2019.
A.D. Flaxman et al., “Online convex optimization in the bandit setting: gradient descent without a gradient”, Nov. 30, 2004, <URL: http://www.cs.cmu.edu/˜mcmahan/soda2005.pdf>, pp. 1-10.
A.Schrijver, “Theory of linear and integer programming”, John Wiley & Sons, 1998, pp. 1-484.
E. Hazan and K. Levy, “Bandit convex optimization: Towards tight bounds”, In Advances in Neural Information Processing Systems, pp. 784-792, 2014.
L. Lovasz and S. Vempala, “The geometry of logconcave functions and sampling algorithms”, Mar. 2005, <URL: https://www.cc.gatech.edu/˜vempala/papers/logcon.pdf>, pp. 1-56.
A. Saumard and J. A. Weilner, “Log-concavity and strong log-concavity: a review, Statistics surveys”, 8:45, 2014, pp. 1-67.
S. Arora, E. Hazan, and S. Kale, “The multiplicative weights update method: a meta algorithm and applications”, Theory of Computing, 8(1):121-164, 2012, pp. 1-31.
A.Belloni, T. Liang, H. Narayanan, and A. Rakhlin, “Escaping the local minima via simulated annealing: Optimization of approximately convex functions”, In Conference on Learning Theory, pp. 240-265, 2015.
Kale, Satyen et al., “Non-Stochastic Bandit Slate Problems”, Neural Information Processing Systems Foundation, Inc., 2010, pp. 1-11.
Related Publications (1)
Number Date Country
20230009019 A1 Jan 2023 US