OPTIMIZATION APPARATUS, OPTIMIZATION METHOD AND NON-TRANSITORY COMPUTER-READABLE MEDIUM STORING OPTIMIZATION PROGRAM

Information

  • Patent Application
  • 20240095610
  • Publication Number
    20240095610
  • Date Filed
    October 24, 2019
    4 years ago
  • Date Published
    March 21, 2024
    a month ago
Abstract
A purpose of the present invention is to achieve highly accurate optimization for a submodular function when a policy is optimized without preparing data for machine learning. An optimization apparatus (100) according to the present invention includes a determination unit (110) that determines one or more execution policies from a predetermined policy in an objective function having diminishing marginal utility and convexity, a reward acquisition unit (120) that acquires reward, which is an execution result in the objective function for the determined execution policy, a calculation unit (130) that calculates an update rate of the policy based on the reward, and an update unit (140) that updates the policy based on the update rate.
Description
TECHNICAL FIELD

The present invention relates to an optimization apparatus, an optimization method, and an optimization program, and particularly to an optimization apparatus, an optimization method, and an optimization program for minimizing a submodular function.


BACKGROUND ART

There are known techniques for optimizing policies from past data using machine learning, such as optimization of product prices. Non Patent Literature 1 discloses a technique for online submodular minimization without using machine learning. Non Patent Literature 2 discloses a technique related to price optimization of a large number of products.


CITATION LIST
Non Patent Literature



  • Non Patent Literature 1: E. Hazan and S. Kale, Online submodular minimization, Journal of Machine Learning Research, 327 13 (October):2903-2922, 2012.

  • Non Patent Literature 2: S. Ito and R. Fujimaki, Large-scale price optimization via network flow, [online], Dec. 5, 2016, [searched on Oct. 18, 2019], Internet <URL: https://papers.nips.cc/paper/6301-large-scale-price-optimization-via-network-flow.pdf>



SUMMARY OF INVENTION
Technical Problem

In order to optimize policies without preparing data for machine learning, the accuracy of Non Patent Literature 1 has been insufficient.


The present disclosure has been made to solve such a problem, and a purpose of the present disclosure is to provide an optimization apparatus, an optimization method, and an optimization program that achieve highly accurate optimization for a submodular function.


Solution to Problem

An optimization apparatus according to a first aspect of the present disclosure includes:

    • a determination means for determining one or more execution policies from a predetermined policy in an objective function having diminishing marginal utility and convexity;
    • a reward acquisition means for acquiring a reward, the reward being an execution result in the objective function for the determined execution policy;
    • an update-rate calculation means for calculating an update rate of the policy based on the reward; and
    • an update means for updating the policy based on the update rate.


An optimization method according to a second aspect of the present disclosure, the method performed by a computer includes:

    • determining one or more execution policies from a predetermined policy in an objective function having diminishing marginal utility and convexity;
    • acquiring a reward, the reward being an execution result in the objective function for the determined execution policy;
    • calculating an update rate of the policy based on the reward; and
    • updating the policy based on the update rate.


An optimization program according to a third aspect of the present disclosure, the program causes a computer to execute:

    • a determination process of determining one or more execution policies from a predetermined policy in an objective function having diminishing marginal utility and convexity;
    • a reward acquisition process of acquiring a reward, the reward being an execution result in the objective function for the determined execution policy;
    • an update-rate calculation process of calculating an update rate of the policy based on the reward; and
    • an update process of updating the policy based on the update rate.


Advantageous Effects of Invention

According to the present invention, it is possible to provide an optimization apparatus, an optimization method, and an optimization program machine that achieve highly accurate optimization for a submodular function when a policy is optimized without preparing data for learning.





BRIEF DESCRIPTION OF DRAWINGS


FIG. 1 is a block diagram showing a configuration of an optimization apparatus according to a first example embodiment;



FIG. 2 is a flowchart showing a procedure of an optimization method according to the first example embodiment;



FIG. 3 is a block diagram showing a configuration of an optimization apparatus according to a second example embodiment; and



FIG. 4 is a flowchart showing a procedure of an optimization method according to the second example embodiment.





DESCRIPTION OF EMBODIMENTS

Embodiments in the present disclosure will be described hereinafter in detail with reference to the drawings. The same or corresponding elements are denoted by the same reference signs in the drawings, and duplicated explanations are omitted as necessary for the sake of clarity.


First Example Embodiment


FIG. 1 is a block diagram showing a configuration of an optimization apparatus 100 according to a first example embodiment. The optimization apparatus 100 is an information processing apparatus that performs image segmentation, learning with structured normalization, and submodular function minimization (SFM) including price optimization. Here, a submodular function is a function shown in Expression 1 defined on a subset of a known finite set [n]:={1, 2, . . . , n} and satisfying the following inequality (1) shown in Expression 2.






f:2(2)custom-character  [Expression 1]





[Expression 2]






f(X)+f(Y)≥f(X∩Y)+f(X∪Y)  (1)


This condition is equal to the following Expression 3, which is a characteristic of diminishing marginal returns and diminishing marginal utility.






X⊆Y⊆[n] and i∈[n]\Y,f(X∪{i})−f(X)≥f(Y∪{i})−f(Y)  [Expression 3]


Thus, the submodular function can be said to be a function having diminishing marginal utility.


The optimization apparatus 100 includes a determination unit 110, a reward acquisition unit 120, an update-rate calculation unit 130, and an update unit 140. The determination unit 110 determines one or more execution policies from a predetermined policy in an objective function having diminishing marginal utility and convexity. The reward acquisition unit 120 acquires a reward, which is an execution result in the objective function for the determined execution policy. The update-rate calculation unit 130 calculates an update rate of the policy based on the reward. The update unit 140 updates the policy based on the update rate.



FIG. 2 is a flowchart showing a procedure of an optimization method according to the first example embodiment. First, the determination unit 110 determines one or more execution policies from a predetermined policy in an objective function having diminishing marginal utility and convexity (S1). The reward acquisition unit 120 acquires a reward, which is an execution result in the objective function for the determined execution policy (S2). The update-rate calculation unit 130 calculates an update rate of the policy based on the reward (S3). The update unit 140 updates the policy based on the update rate (S4).


By updating the policy in this manner, it is possible to determine an optimal policy without using past data. Thus, it is possible to achieve highly accurate optimization for a submodular function.


Note that, the optimization apparatus 100 includes a processor, a memory, and a storage device that are not illustrated. In addition, the storage device stores a computer program implementing processes of the optimization method according to the present example embodiment. Then, the processor loads the computer program from the storage device into the memory and executes the computer program. Accordingly, the processor performs the functions of the determination unit 110, the reward acquisition unit 120, the update-rate calculation unit 130, and the update unit 140.


Alternatively, each of the determination unit 110, the reward acquisition unit 120, the update-rate calculation unit 130, and the update unit 140 may be implemented by dedicated hardware. In addition, a part or all of each constituent element of each apparatus may be implemented by a general-purpose or dedicated circuitry, a processor, or the like or a combination thereof. These may be configured by a single chip or a plurality of chips connected via a bus.


A part or all of each constituent element of each apparatus may be implemented by a combination of the circuitry or the like described above and a program. In addition, as a processor, a central processing unit (CPU), a graphics processing unit (GPU), a field-programmable gate array (FPGA) or the like can be used.


If a part or all of each constituent element of the optimization apparatus 100 is implemented by a plurality of information processing apparatuses, circuitries, or the like, the plurality of information processing apparatuses, circuitries, or the like may be collectively or dispersedly arranged. For example, the information processing apparatuses, circuitries, or the like may be implemented by being connected with each other via a communication network, such as a client server system or a cloud computing system. In addition, the function of the optimization apparatus 100 may be provided in a form of Software as a Service (SaaS).


Second Example Embodiment

Here, in some applications, exact evaluation oracles are not always available, but function values containing noise can be observed. For example, in a price optimization problem, it is assumed to consider selling n-types of products. Here, the value of an objective function f(X) corresponds to the expected overall profit, and a variable X⊆[n] corresponds to a set of discounted products.


In this scenario, Non Patent Literature 2 discloses that −f(X) is a submodular function in a certain assumption. Here, the problem of maximizing the overall profit f(X) means an example of Submodular function minimization (SFM).


However, in a practical situation, no explicit expression for f is given, and only the sales of products can be observed while the price changes. The observed overall profit does not match its expected value f(X) but varies randomly due to the inherent randomness of buying behavior or any transient event. This means that the exact value of f(X) is not available. For this reason, existing studies cannot be directly applied to this situation.


In order to cope with such a problem, SFM using a noise-including evaluation oracle that returns a random value using the expected value f(X) is proposed in the present disclosure. In other words, a noise-including evaluation oracle f{circumflex over ( )} is f{circumflex over ( )}(X)=f(X)+ξ, where ξ is the zero-mean noise, which may or may not depend on X.


It is assumed to access to noise-including evaluation oracles f{circumflex over ( )}1, f{circumflex over ( )}2, . . . , f{circumflex over ( )}T, which have a bounded range and are independent from T. In the present disclosure, the case of using a single-point feedback setting and the case of using a more general multiple-point feedback (or k-point feedback) setting are described. In the former setting, for each t∈[T], a feed is selected from one query Xt to f{circumflex over ( )}t to obtain the feedback to f{circumflex over ( )}t(Xt). In the latter setting, an integer k is given, and k queries are selected for each t to feed f{circumflex over ( )}t, and k real numbers of the feedback are observed. Note that, even if its expected value is submodular, each f{circumflex over ( )}t does not necessarily have to be submodular.


<Problem Setting>

It is assumed that n is a natural number (positive integer) and that a finite set of natural numbers at the largest n is [n]={1, 2, . . . , n}. It is assumed that a distributive lattice is L⊆2[n]. That is, it is assumed that X and Y∈L imply Expression 4.






X∩Y,X∪Y∈L  [Expression 4]


It is assumed that f:L→[−1, 1] is a submodular function that sets the minimization goal. In this problem setting, access to the exact value of f is not given, but a noise-including evaluation oracle of f shown in Expression 5 is given.





{{circumflex over (f)}t}t=1T  [Expression 5]


where, f{circumflex over ( )}t is a random function from L to [−1, 1] that satisfies E[f{circumflex over ( )}t(X)]=f(X) for all of t=1, 2, . . . T and X∈L, and E[ ] is an expected value.


The purpose of the present example embodiment is to build an algorithm for solving the following problem. First, the algorithm is given a decision set L and the number for available oracle calls T. For t=1, 2, . . . T, the algorithm selects Xt∈L and observes f{circumflex over ( )}t(Xt). The selected query Xt can depend on the previous observation value shown in Expression 6.





{{circumflex over (f)}k(Xj)}j=1t-1  [Expression 6]


After T rounds of observation, the algorithm outputs X{circumflex over ( )}∈L. This problem is referred to as a single-point feedback setting.


In a multiple-point or k-point feedback setting, which is another problem setting, a parameter k≥2 is given in addition to T and L. In the k-point feedback setting, the algorithm selects k queries Xt(1), Xt(2), . . . , and Xt(k)∈L, then observes the values f{circumflex over ( )}t(Xt(1)), f{circumflex over ( )}t(Xt(2)), . . . , and f{circumflex over ( )}t(Xt(k)) from the evaluation oracle in each round t E T. In both settings, the performance of the algorithm is evaluated for the period of an additive error ET defined as Expression 7.






E
T
=f({circumflex over (X)})−minX∈Lf(X)  [Expression 7]


A part of the result depends on the following Assumption 1. Note that, the following is assumed only when explicitly mentioned.


<Assumption 1> Each f{circumflex over ( )}t:L→[−1, 1] is Submodular and k≥2


Here, the first half of Assumption 1 indicates that the output value of the objective function is a submodular function even when it includes noise. This point can be arbitrarily determined by a user. The second half of Assumption 1 indicates that a plurality of policies can be simultaneously executed, that is, it is the k-point feedback.


<Lovász Extension of Submodular Function>

For a vector x=(x1, . . . , xn)T∈[0, 1]d of the value [0, 1] and a real number value u∈[0, 1], a set of indices i for xi≥u is defined as Hx(u)⊆[n].


That is, this is expressed as Expression 8.






H
x(u)={i∈[n]|xi≥u}  [Expression 8]


The distributive lattice L is expressed as Expression 9, and the convex hull of L is defined as Expression 10.






{tilde over (L)}={x⊆[0,1]n|Hx(u)∈L for all u∈[0,1]}  [Expression 9]






{tilde over (L)}⊆[0,1]n  [Expression 10]


If the function f:L→R is given, the Lovász extension of f (Expression 11) is defined as the following (2) shown in Expression 12.






{tilde over (f)}:{tilde over (L)}→
custom-character
  [Expression 11]





[Expression 12]






{tilde over (f)}(x)=∫01f(Hx(u))du  (2)


From this definition, all X∈L is expressed as Expression 13.






{tilde over (f)}X)=f(X)  [Expression 13]


That is, Expression 14 is the extension of f.






{tilde over (f)}  [Expression 14]


The following theorem provides a connection between the submodular function and the convex function.


<Theorem 1>

If and only if Expression 15 is convex, the function f:L→ is submodular.






{tilde over (f)}  [Expression 15]


The submodular function f:L→R is the following Expression 16.





minX∈Lf(x)=minX∈{tilde over (L)}{tilde over (f)}(x)  [Expression 16]


For x∈[0, 1], a permutation function on [n] such as xσ(1)≥xσ(2)≥ . . . ≥xσ(n) is assumed to be σ:[n]→[n]. In addition, Sσ(i)={σ(j)|j≤i} is defined for any permutation on [n]. Lovász extension defined by (2) can be rewritten as the following (3) shown in Expression 17 and (4) shown in Expression 18.









[

Expression


17

]











f
~

(
x
)

=


f
[

(
0
)

]

+




i
=
1

n



(


f

(


S
σ

(
i
)

)

-

f

(


S
σ

(

i
-
1

)

)


)



x

σ

(
i
)









(
3
)












[

Expression


18

]











f
~

(
x
)

=



f

(

[
0
]

)



(

1
-

x

σ

(
1
)



)


+




i
=
1

n




f

(


S
σ

(
i
)

)



(


x

σ

(
i
)


-

x

σ

(

i
+
1

)



)



+


f

(

[
n
]

)



x

σ

(
n
)








(
4
)







<Subgradient of Lovász Extension>

From the above two Lovász extension expressions of (3) and (4), two alternative ways to express its subgradient is obtained. The permutation function σ and i∈{0, 1, . . . , n} on [n] are defined as ψσ(i)⊆{−1, 0, 1}n as in the following (5) shown in Expression 19.









[

Expression


19

]


















Ψ
σ



(
0
)


=

-

χ

σ

(
1
)




,









Ψ
σ

(
n
)

=

χ

σ

(
n
)



,

















Ψ
σ

(
i
)

=


χ

σ

(
i
)


-

χ

σ

(

i
+
1

)








(


i
=
1

,
2
,


,

n
-
1


)










(
5
)







The subgradient of Expression 20 at x can be expressed by g(σx) defined as the following (6) shown in Expression 21 and (7) shown in Expression 22. Here, (6) and (7) are derived from (3) and (4), respectively.









[

Expression


20

]









f
˜




(
6
)













g

(
σ
)

:=




n


i
=
1




(


f

(


S
σ

(
i
)

)

-

f

(


S
σ

(

i
-
1

)

)


)



χ

σ

(
i
)








[

Expression


21

]
















g

(
σ
)

=




-

f

(

[
0
]

)




χ

σ

(
i
)



+




i
=
1


n
-
1




(



f

(


S
σ

(
i
)

)



(


χ

σ

(
i
)


-

χ

σ

(

i
+
1

)



)


+



f

(

[
n
]

)



χ

σ

(
n
)













=





n


i
=
1




f

(


S
σ

(
i
)

)




Ψ
σ

(
i
)










[

Expression


22

]







<Configuration of Optimization Apparatus>


FIG. 3 is a block diagram showing a configuration of an optimization apparatus 200 according to a second example embodiment. The optimization apparatus 200 is an information processing apparatus that is a concrete example of the optimization apparatus 100. The optimization apparatus 200 includes a storage unit 210, a memory 220, an interface (IF) unit 230, and a control unit 240.


The storage unit 210 is a storage device such as a hard disk or a flash memory. The storage unit 210 stores at least an optimization program 211. The optimization program 211 is a computer program implementing an optimization method according to the present example embodiment.


The memory 220 is a volatile storage device such as a random access memory (RAM) and is a storage area for temporarily holding information while the control unit 240 operates. The IF unit 230 is an interface that performs input/output to/from the outside of the optimization apparatus 200. For example, the IF unit 230 accepts input data from other computers or the like via a network (not shown) and outputs the accepted input data to the control unit 240. The IF unit 230 further outputs, in response to an instruction from the control unit 240, data to a transmission destination computer via the network. Alternatively, the IF unit 230 accepts a user's operation via an input device (not shown) such as a keyboard, a mouse, or a touch panel and outputs the accepted operation to the control unit 240. In addition, the IF unit 230 outputs, in response to an instruction from the control unit 240, to a touch panel, a display device, or a printer (not shown).


The control unit 240 is a processor such as a central processing unit (CPU) and controls each unit of the optimization apparatus 200. The control unit 240 loads the optimization program 211 from the storage unit 210 into the memory 220 to execute the optimization program 211. Accordingly, the control unit 240 performs the functions of a determination unit 241, a reward acquisition unit 242, an update-rate calculation unit 243, and an update unit 244. Note that, the determination unit 241, the reward acquisition unit 242, the update-rate calculation unit 243, and the update unit 244 are examples of the determination unit 110, the reward acquisition unit 120, the update-rate calculation unit 130, and the update unit 140, respectively.


The determination unit 241 acquires, from a plurality of element values included in a predetermined policy, a predetermined number of values in descending order and determines a set of the acquired element values as an execution policy. For example, the determination unit 241 permutes a set of policies for a predetermined objective function and determines, depending on whether Assumption 1 is satisfied or not, one or more execution policies from the set of the policies. For example, when Assumption 1 is satisfied, the determination unit 241 determines a subset of two or more execution policies. On the other hand, when Assumption 1 is not satisfied, the determination unit 241 determines a subset of one execution policy. Here, the objective function is, for example, a function (in which policies are continuous values) having convexity subjected to Lovász-extension from a submodular function f in which policies are discontinuous values.


The reward acquisition unit 242 acquires a reward that is an execution result in the objective function for the determined execution policy. For example, the reward acquisition unit 242 gives the set of the determined execution policy to the noise-including evaluation oracle to acquire a reward as an execution result. In particular, the reward acquisition unit 242 acquires two or more rewards using the determined two or more execution policies.


The update-rate calculation unit 243 calculates a policy update rate based on the reward. In particular, the update-rate calculation unit 243 calculates an update rate using the two or more rewards. In addition, it is desirable that the update-rate calculation unit 243 calculates an update rate based on the difference between the two or more rewards.


The update unit 244 updates the policy based on the update rate.


Note that, the details of the above described processes are included in the following description of the flowchart.


<Procedure of Optimization Method>


FIG. 4 is a flowchart showing a procedure of an optimization method according to the second example embodiment. First, as preconditions, it is assumed that the size of a problem is size n≥1, the number of oracle calls is T≥1, the number of feedback values per oracle call is k∈[n+1], and learning rate is η>0. In addition, it is assumed that the user can arbitrarily change the setting for whether the above Assumption 1 is satisfied or not (function conditions and the value of k). This algorithm is based on the stochastic gradient descent for Expression 23.






{tilde over (f)}:{tilde over (L)}→[0,1]  [Expression 23]


First, the control unit 240 sets an initial policy vector as Expression 24 (S201).










x
1

=



1
2

·
1



L
~






[

Expression


24

]







where, the vector “1” has n element values, each of which is 1. That is, the policy vector x1 is an n-dimensional vector in which each element value is ½.


Next, the control unit 240 adds t by 1 from round t=1 to T and repeats the following steps S203 to S213 (S202).


First, the determination unit 241 permutes, using the above permutation function σ:[n]→[n], the element values of the policy vector xt to be xtσ(1)≥ . . . ≥xtσ(n) (S203). That is, the permutation function a permutes the element values of the policy vector xt in descending order.


Next, the determination unit 241 determines whether the above Assumption 1 is satisfied or not (S204). When it is determined that Assumption 1 is not satisfied in step S204, for example, if k∈[n+1] includes noise and is not the submodular function or in the case of k=1, steps S209 to S212 are performed.


The determination unit 241 randomly selects a subset It of size k from integers from 0 to n (S209). Here, the subset It is defined by Expression 25, and follows the uniform distribution on the subset family {I⊆{0, 1, . . . , n}∥I|=k}.






I
t
={i
t
(j)}j=1k⊆{0,1, . . . ,n}  [Expression 25]


Next, the determination unit 241 calculates an execution policy based on the following Expression 26 (S210).






X
t
(j)
=S
σ(it(j))={σ(j)|j≤it(j)}  [Expression 26]


That is, the determination unit 241 selects it(j) element values from the policy vector xt in descending order of the element values and determines them as a set Xt(j) of execution policies, that is, as Sσ(it(j)). Here, Sσ(i) is a function for selecting i element values from the permuted element values of the policy vector xt in descending order to form a set. Note that, Sσ(i) can also be said to select the following random query shown in Expression 27.





{Xt(j)}j=1k  [Expression 27]


Then, the reward acquisition unit 242 observes Expression 28 and acquires it as a reward (S211).






{circumflex over (f)}
t(Sσ(i))  [Expression 28]


That is, the reward acquisition unit 242 calls the noise-including evaluation oracle f{circumflex over ( )} to observe Expression 29.






{circumflex over (f)}
t(Xt(j))  [Expression 29]


Thereafter, the update-rate calculation unit 243 calculates a policy update rate g{circumflex over ( )}t with the following (8) shown in Expression 30.









[

Expression


30

]











g
^

t

=




n
+
1

k








j
=
1

k





f
^

t

(

X
t

(
j
)


)




Ψ
σ

(

i
t

(
j
)


)


=



n
+
1

k








i


I
t







f
^

t

(


S
σ

(
i
)

)




Ψ
σ

(
i
)







(
8
)







where, ψσ(i) is defined by the above (5). Then, g{circumflex over ( )}t is the unbiased estimator of the subgradient and satisfies Expression 31.






E[∥ĝ
t22]=O(n2/k)  [Expression 31]


<Auxiliary Theorem 1>

It is assumed that g{circumflex over ( )}t is given by the above (8). Then, g{circumflex over ( )}t satisfies the following (9) shown in Expression 32.





[Expression 32]






E[ĝ|x
t
]∈∂{tilde over (f)}(xt),E[∥ĝt22]≤2(n+1)(n+k)/k   (9)


On the other hand, when it is determined that Assumption 1 is satisfied in step S204, that is, in the case of the submodular function in the noise-including evaluation oracle and 2≤k(≤n+1), steps S205 to S207 are performed.


First, the determination unit 241 randomly selects a subset Jt of size 1 from natural numbers from 1 to n (S205). Here, size 1 is the largest integer of k/2 or less and expressed as the following Expression 33.






l=└k/2┘≥1  [Expression 33]


For example, in the case of k=2 or 3, any one of natural numbers from 1 to n is selected for the subset Jt. In the case of k=4, any two of natural numbers from 1 to n are selected for the subset Jt.


Next, the determination unit 241 calculates an execution policy based on the following (S206). That is, to satisfy Expression 34, a query shown in Expression 35 is selected.





i∈Jt{Sσ(i),Sσ(i−1)}⊆{Xt(j)}j=1k  [Expression 34]





{Xt(j)}j=1k  [Expression 35]


In other words, the determination unit 241 determines a set Sσ(i) of execution policies for which i element values are selected from the policy vector xt in descending order of the element values, and a set Sσ(i−1) of execution policies for which i−1 element values are selected from the policy vector xt in descending order of the element values as execution policies.


Then, the reward acquisition unit 242 observes, for i∈Jt, Expression 36 and Expression 37 and acquires them as rewards (S207).






{circumflex over (f)}
t(Sσ(i))  [Expression 36]






{circumflex over (f)}
t(Sσ(i−1))  [Expression 37]


That is, the reward acquisition unit 242 calls the noise-including evaluation oracle f{circumflex over ( )} to observe them.


Thereafter, the update-rate calculation unit 243 calculates a policy update rate g{circumflex over ( )}t with the following (10) shown in Expression 38.









[

Expression


38

]











g
ˆ

t

=



n
l








i


J
t





(



f
^

t


(


S
σ

(
i
)

)

)


-




f
^

t

(


S
σ

(

i
-
1

)

)



χ

σ

(
i
)








(
10
)







Then, g{circumflex over ( )}t is the unbiased estimator of the subgradient and satisfies Expression 39.






E[∥ĝ
t22]=O(n2/k)  [Expression 39]


If Expression 40 is a submodular function, Expression 41 is retained.






{circumflex over (f)}
t  [Expression 40]






E[∥ĝ
t22]=O(n/k)  [Expression 41]


<Auxiliary Theorem 2>

It is assumed that g{circumflex over ( )}t is given by the above (10). Then, it satisfies the following (11) shown in Expression 42.





[Expression 42]






E[ĝ
t
|x
t
]∈∂{tilde over (f)}(xt),E[∥ĝt22]≤4n2/l≤12n2/k   (11)


In addition, if f{circumflex over ( )}t is a submodular function, it satisfies the following (12) shown in Expression 43.





[Expression 43]





[∥ĝt22]≤16n/l≤48n/k  (12)


After step S208 or S212, the update unit 244 updates xt with the following Expression 44 (S213).






x
t+1
=P
Ĺ(xt−ηĝt)  [Expression 44]


Here, Expression 45 indicates the euclidean projection to Expression 46 and is Expression 47.










P

L
~


:


n



L
~






[

Expression


45

]












L
~




[

Expression


46

]














P

L
~


(
x
)





arg


min


y


L
~









y
-
x



2






[

Expression


47

]







Then, η>0 is a parameter that can be varied arbitrarily.


Thereafter, the control unit 240 determines whether the round t equals T (S214) and returns to step S202 when t is less than T to perform t=t+1 and performs step S203 and subsequent steps. When t equals T in step S214, the control unit 240 calculates the average x of x with the following Expression 48.










x
_

=


1
T








t
=
1

T



x
t






[

Expression


48

]







Then, the control unit 240 randomly selects u from the uniform distribution on [0, 1] and calculates an optimal solution X{circumflex over ( )} based on the following Expression 49.






{circumflex over (X)}=H

x
(u)=[Expression 49]


Thereafter, the control unit 240 outputs the optimal solution X{circumflex over ( )}. That is, the control unit 240 outputs the optimal solution X{circumflex over ( )} based on the average of xt (S215). Thereafter, this optimization processing is completed. Here, the following Expression 50 holds from the above (2).






E[f(k)]=E[{tilde over (f)}(x)]  [Expression 50]


Subsequently, the following theorem is used to analyze the performance of this algorithm.


<Theorem 2>

It is assumed that a compact convex set containing 0 is D∈Rn. It is assumed that the convex function shown in Expression 51 is x1, . . . , xT defined by x1=0 and xt+1=PD(xt−ηg{circumflex over ( )}t).






{tilde over (f)}:D→
custom-character
  [Expression 51]


where, E[g{circumflex over ( )}t|xt] is the subgradient of Expression 52 at xt for each t.






{tilde over (f)}  [Expression 52]


Then, Expression 53 satisfies (13) shown in Expression 54.









[

Expression


53

]










x
_

:=


1
T








t
=
1

T



x
t






(
13
)












E
[




f
~

(

x
_

)

-


min


x
*


D





f
~

(

x
*

)






1
T



(




max

x

D





x


2
2



2

η


+


η
2






t
=
1

T



E
[





g
^

t



2
2

]




)







[

Expression


54

]







<Effects>

The present disclosure has the following effects.


<Theorem 3>

It is assumed to be 1≤k≤n+1. For a problem using a k-point feedback, there is an algorithm such as (14) shown in Expression 55 that returns X{circumflex over ( )}.





[Expression 55]






E[E
T
]=E[f({circumflex over (X)})]−minX∈Lf(X)=O(n3/2/√{square root over (kT)})   (14)


If Assumption 1 is maintained, there is an algorithm such as (15) shown in Expression 56 that returns X{circumflex over ( )}.





[Expression 56]






E[E
T
]=E[f({circumflex over (X)})]−minX∈Lf(X)=O(n/√{square root over (kT)})   (15)


The expected value is obtained with respect to the randomness of the oracle f{circumflex over ( )}t and the internal randomness of the algorithm. In both algorithms, the execution time is reduced (bounded) by O((kEO+n log n)T). Where, EO is expressed as the time obtained by the evaluation oracle answering a single query.


If the number of oracle calls T is arbitrarily selected, an algorithm using error bound (14) can calculate an approximate solution (in an expected value) in F addition for any ε>0. The calculation time for it is Expression 57.









O

(



n
3


ε
2




(

EO
+


n
k



log


n


)


)




[

Expression


57

]







In fact, in order to find an approximate solution in ε addition, a set T is satisfied as in Expression 58, which is equivalent to Expression 59.









ε
=

Θ

(


n

3
/
2



kT


)





[

Expression


58

]












T
=

Θ

(


n
3


k


ε
2



)





[

Expression


59

]







The calculation time is reduced as shown in Expression 60.










O

(


(

kEO
+

n


log


n


)


T

)

=

O

(



n
3


ε
2




(

EO
+


n
k



log


n


)


)





[

Expression


60

]







Similarly, if Assumption 1 is maintained and the algorithm that achieves the above (15) is used, the approximate solution in F addition can be searched in the time shown in Expression 61.









O

(



n
2


ε
2




(

EO
+


n
k



log


n


)


)




[

Expression


61

]







In Non Patent Literature 1, Expression 62 holds in a single-point feedback setting.












O


(

n

T

1
/
3



)





(


if


T

=

Ω

(

n
3

)



)







[

Expression


62

]







However, in the present disclosure, Expression 63 holds.












O


(


n

3
/
2



T


)



and



Ω




(

n

T


)






(


Ω


(
·
)

)

:=

Ω

(

min



{

1
,
·

}


)








[

Expression


63

]







In the present disclosure, assuming that f{circumflex over ( )}t is the submodular function in a k-point feedback setting, Expression 64 holds.










O

(

n

kT


)



and




Ω


(


n



2
k


T



+


n


T



)





[

Expression


64

]







Next, examples 2-1 to 2-3 according to the present second example embodiment are described.


Example 2-1

The present example 2-1 describes a case where the optimization apparatus is applied to a price optimization problem when cannibalization can occur depending on how prices are set among competing products. In this case, it is assumed that the above objective function uses, as input, a set of discount policies for each of a plurality of products as the execution policy determined in the above second example embodiment, and outputs a payout amount of each product based on the cannibalization among the plurality of products.


Here, beers are taken as examples of competing products. Then, it is assumed that a policy is a discount on the price of each company's beer at a certain store. For example, when the execution policy X21=[0, 2, 1, . . . ] is set, the first element indicates that the beer price of a company A is the fixed price, the second element indicates that the beer price of a company B is 10% higher than the fixed price, and the third element indicates that the beer price of a company C is 10% discounted from the fixed price. Then, the objective function uses, as input, the execution policy X21 and outputs a result of the sales by applying the execution policy X21 to the beer price of each company. In this case, by applying the optimization method according to the second example embodiment, it is possible to derive the optimal price setting for the beer price of each company at the store without using past history.


Example 2-2

This example 2-2 describes a case where the optimization apparatus is applied to investment behavior of investors or the like. In this case, it is assumed that the execution policies are investment (purchasing, capital increase), sales, holding of a plurality of financial instruments (stocks or the like) held or to be held by investors. For example, when the execution policy X22=[1, 0, 2, . . . ] is set, the first element indicates additional investment in the shares of a company A, the second element indicates holding the claims of a company B (not purchasing or selling), and the third element indicates sale of the shares of a company C. Then, the objective function uses, as input, the execution policy X22 and outputs the result of applying the execution policy X to investment behavior in each company's financial instruments. In this case, by applying the optimization method according to the above second example embodiment, it is possible to derive the investors' optimal investment behavior in each stock without using past history.


Example 2-3

This example 2-3 is, as the case of k=2, a price optimization apparatus that dynamically changes product prices in, for example, electronic commerce depending on customers. In this case, first, it is assumed that a price set for a product set at a certain of an electronic commerce site is an execution policy X23. Then, the objective function uses, as input, the execution policy X23 and outputs the profit from the sale of each product during a predetermined period of time as a reward. At this time, the price optimization apparatus determines a price set to be proposed to a customer A as an execution policy X23-1 (corresponding to Sσ(i)) and a price set to be proposed to a customer B as an execution policy X23-2 (corresponding to Sσ(i−1)). Then, the price optimization apparatus performs the execution policy X23-1 for the customer A and simultaneously performs the execution policy X23-2 for the customer B. That is, the price optimization apparatus changes and proposes a product price for each customer. As a result, the price optimization apparatus acquires a “reward” from each of the customers A and B. Updating the next execution policies with these rewards can be said that the optimization method is applied in k=2 in the above second example embodiment. Thus, this price optimization apparatus can derive an optimal price setting for each customer in the certain shop of the electronic commerce site.


Other Example Embodiments

The present disclosure is described as a hardware configuration in the above example embodiments but is not limited thereto. The present disclosure can be achieved by a central processing unit (CPU) executing a computer program for any processing.


In the above examples, the program can be stored and provided to a computer using any type of non-transitory computer-readable media. Non-transitory computer-readable media include any type of tangible storage media.


Examples of non-transitory computer readable-media include magnetic storage media (such as flexible disks, magnetic tapes, hard disk drives, etc.), optical magnetic storage media (such as magneto-optical disks), Compact Disc Read Only Memory (CD-ROM), CD-R, CD-R/W, Digital Versatile Disc (DVD), and semiconductor memories (such as mask ROM, Programmable ROM (PROM), Erasable PROM (EPROM), flash ROM, Random Access Memory (RAM), etc.).


The program may be provided to a computer using any type of transitory computer-readable media. Examples of transitory computer-readable media include electric signals, optical signals, and electromagnetic waves. Transitory computer-readable media can provide the program to a computer via a wired communication line (such as electric wires and optical fibers) or a wireless communication line.


Note that, the present disclosure is not limited to the above example embodiments and can be modified without departing from the gist thereof. In addition, the present disclosure may be implemented by combining the example embodiments as appropriate.


The whole or part of the above example embodiments can be described as, but not limited to, the following Supplementary notes.


(Supplementary Note A1)

An optimization apparatus comprising:

    • a determination means for determining one or more execution policies from a predetermined policy in an objective function having diminishing marginal utility and convexity;
    • a reward acquisition means for acquiring a reward, the reward being an execution result in the objective function for the determined execution policy;
    • an update-rate calculation means for calculating an update rate of the policy based on the reward; and
    • an update means for updating the policy based on the update rate.


(Supplementary Note A2)

The optimization apparatus according to Supplementary note A1, wherein the objective function uses, as input, a set of discount policies for each of a plurality of products as the execution policy, and outputs a payout amount for each product based on cannibalization among the plurality of products.


(Supplementary Note A3)

The optimization apparatus according to Supplementary note A1 or A2, wherein the determination means acquires a predetermined number of element values in descending order from a plurality of element values contained in the predetermined policy, and determines a set of acquired element values as the execution policy.


(Supplementary Note A4)

The optimization apparatus according to any one of Supplementary notes A1 to A3, wherein

    • the determination means determines two or more execution policies,
    • the reward acquisition means acquires two or more of the rewards using the determined two or more execution policies, and
    • the update-rate calculation means calculates the update rate using the two or more of the rewards.


(Supplementary Note A5)

The optimization apparatus according to Supplementary note A4, wherein the update-rate calculation means calculates the update rate based on a difference between the two or more of the rewards.


(Supplementary Note B1)

An optimization method performed by a computer, the method comprising:

    • determining one or more execution policies from a predetermined policy in an objective function having diminishing marginal utility and convexity;
    • acquiring a reward, the reward being an execution result in the objective function for the determined execution policy;
    • calculating an update rate of the policy based on the reward; and
    • updating the policy based on the update rate.


(Supplementary Note C1)

A non-transitory computer-readable medium storing an optimization program causing a computer to execute:

    • a determination process of determining one or more execution policies from a predetermined policy in an objective function having diminishing marginal utility and convexity;
    • a reward acquisition process of acquiring a reward, the reward being an execution result in the objective function for the determined execution policy;
    • an update-rate calculation process of calculating an update rate of the policy based on the reward; and
    • an update process of updating the policy based on the update rate.


The present invention has been described with reference to the example embodiments (and examples), but is not limited to the example embodiments (and examples). Various modifications that can be understood by those skilled in the art can be made to the configurations and the details of the present invention without departing from the scope of the invention.


REFERENCE SIGNS LIST






    • 100 Optimization apparatus


    • 110 Determination unit


    • 120 Reward acquisition unit


    • 130 Update-rate calculation unit


    • 140 Update unit


    • 200 Optimization apparatus


    • 210 Storage unit


    • 211 Optimization program


    • 220 Memory


    • 230 IF unit


    • 240 Control unit


    • 241 Determination unit


    • 242 Reward acquisition unit


    • 243 Update-rate calculation unit


    • 244 Update unit




Claims
  • 1. An optimization apparatus comprising: at least one memory configured to store instructions; andat least one processor configured to execute the instructions to:determine one or more execution policies from a predetermined policy in an objective function having diminishing marginal utility and convexity;acquire reward, the reward being an execution result in the objective function for the determined one or more execution policies;calculate an update rate of the policy based on the reward; andupdate the policy based on the update rate.
  • 2. The optimization apparatus according to claim 1, wherein the objective function uses, as input, a set of discount policies for each of a plurality of products as the execution policy, and outputs a payout amount for each product based on cannibalization among the plurality of products.
  • 3. The optimization apparatus according to claim 1, wherein the at least one processor is further configured to execute the instructions to: acquire a predetermined number of element values in descending order from a plurality of element values contained in the predetermined policy, and determine a set of acquired element values as the execution policy.
  • 4. The optimization apparatus according to claim 1, wherein the at least one processor is further configured to execute the instructions to: determine two or more execution policies,acquire two or more of the rewards using the determined two or more execution policies, andcalculate the update rate using the two or more of the rewards.
  • 5. The optimization apparatus according to claim 4, wherein the at least one processor is further configured to execute the instructions to: calculate the update rate based on a difference between the two or more of the rewards.
  • 6. An optimization method performed by a computer, the method comprising: determining one or more execution policies from a predetermined policy in an objective function having diminishing marginal utility and convexity;acquiring a reward, the reward being an execution result in the objective function for the determined execution policy;calculating an update rate of the policy based on the reward; andupdating the policy based on the update rate.
  • 7. A non-transitory computer-readable medium storing an optimization program causing a computer to execute: a determination process of determining one or more execution policies from a predetermined policy in an objective function having diminishing marginal utility and convexity;a reward acquisition process of acquiring a reward, the reward being an execution result in the objective function for the determined execution policy;an update-rate calculation process of calculating an update rate of the policy based on the reward; andan update process of updating the policy based on the update rate.
PCT Information
Filing Document Filing Date Country Kind
PCT/JP2019/041685 10/24/2019 WO