The present disclosure relates to an information processing apparatus, an information processing method, and an information processing program.
Various proposals have been made concerning a learned model for automatically generating contents (hereinafter also referred to as “sequence”) that provide a sequence of information such as music. For example, Patent Literature 1 discloses a method of selectively learning a feature value designated by a user such that a sequence is generated in a mode desired by the user.
In some case, it is desired to generate a sequence in which only a part is generated anew and the remainder is maintained. This point is not specifically studied in Patent Document 1.
An aspect of the present disclosure provides an information processing apparatus, an information processing method, and an information processing program capable of generating a sequence in which only a part is generated anew and the remainder is maintained.
An information processing apparatus according to one aspect of the present disclosure includes: control means; data input means for inputting sequence data; a machine learning model that generates new sequence data based on the sequence data input by the data input means; and sequence data selecting means for, when the new sequence data is generated by the machine learning model, selecting target sequence data for changing the sequence data and/or context sequence data for not changing the sequence data, wherein the control means: (i) generates new target sequence data that interpolates at least two sequence data already generated by the machine learning model; or (ii) generates new different sequence data for the sequence data already generated by the machine learning model.
An information processing apparatus according to one aspect of the present disclosure includes a generation unit that generates a sequence including a determined context sequence and a new target sequence using input information and a learned model, the input information being information concerning a sequence in which a part is configured by a target sequence and a remainder is configured by a context sequence and that provides a series of information, wherein when data corresponding to the input information is input, the learned model outputs data corresponding to the new target sequence.
An information processing apparatus according to one aspect of the present disclosure includes: a generation unit that generates a sequence including a determined context sequence and a new target sequence using input information and a learned model, the input information being information concerning a sequence in which a part is configured by a target sequence and a remainder is configured by a context sequence and that provides a series of information; and a user interface that receives the input information and presents a generation result of the generation unit, wherein when data corresponding to the input information is input, the learned model outputs data corresponding to the new target sequence.
An information processing method according to one aspect of the present disclosure includes generating a sequence including a determined context sequence and a new target sequence using input information and a learned model, the input information being information concerning a sequence in which a part is configured by a target sequence and a remainder is configured by a context sequence and that gives a series of information, wherein when data corresponding to the input information is input, the learned model outputs data corresponding to the new target sequence.
An information processing program according to one aspect of the present disclosure causes a computer to execute generating a sequence including a determined context sequence and a new target sequence using input information and a learned model, the input information being information concerning a sequence in which a part is configured by a target sequence and a remainder is configured by a context sequence and that gives a series of information, wherein when data corresponding to the input information is input, the learned model outputs data corresponding to the new target sequence.
An embodiment of the present disclosure is explained in detail below with reference to the drawings. Note that, in the embodiment explained below, redundant explanation is omitted by denoting the same elements with the same reference numerals and signs.
The present disclosure is explained according to order of items described below.
Processing target information of an information processing apparatus according to an embodiment is a sequence (sequence data) that provides a series of information. Examples of the sequence include music (a music sequence, audio, and the like) and language (document, poetry). In the following explanation, a case in which the sequence is a music sequence is mainly explained as an example.
The sequence x is divided into a plurality of sequences by operation relating to an item “range designation”. For example, a part of the visualized and displayed sequence x is selected as a range and is divided into a selected portion and the other portion. A part of the divided sequence x is referred to as target sequence (illustrated by hatching) and the remainder is referred to as context sequence xc. The target sequence is a portion requested to be changed (applied with change). The context sequence xc is a portion requested not to be changed (maintained, not applied with change). Since the context sequence xc is not changed, it can be said that the context sequence xc is a determined context sequence xc. By the operation related to the item “range designation”, position information (equivalent to position information R in
A sequence is generated by operation relating to an item “search”. As explained in detail below, when “normal generation” is designated, a sequence is generated based on the context sequence xc input by the operation related to “sequence selection” explained above and the position information input by the operation relating to “range designation”.
Referring to
By the operation relating to the item “search”, a further sequence is generated based on the sequence A (as a starting point). As explained in detail below, when “variation generation” is designated, a sequence including a target sequence different from the target sequence xT of the generated sequence is generated. In operation related to “feature designation”, a feature of the sequence is designated. In this example, any position (feature) in a latent space FS specifying a feature value of the sequence is designated and a sequence having the feature (a feature value corresponding to the designated position) is generated. This sequence is also a sequence including a target sequence different from the target sequence xTA of the sequence A. For example, via these kinds of operation, a plurality of generated sequences each including a different new target sequence are obtained.
Referring to
In the item “search”, a further sequence is generated based on the sequence A and the like. As explained in detail below, when “interpolation generation” is designated, a sequence having an intermediate feature between features of the designated sequences (in this example, the sequence A and the sequence B) is generated. The “variation generation” and the “feature designation” are as explained above with reference to
Note that operation of various forms may be presented by the user interface other than the operation screens illustrated in
The user interface 10 has a function of an input unit (a reception unit) that receives information according to user operation. It can also be said that the user interface 10 has a function of data input means for inputting sequence data. For example, as explained above with reference to
The input information includes information concerning a sequence. The information concerning the sequence is information concerning a sequence including a determined context sequence xc. Examples of such input information are the information concerning the sequence x explained above with reference to
The input information may include information for designating at least one sequence among a plurality of generated sequences. An example of such input information is information for designating the sequence A and the like explained above with reference to
The input information may include information for designating a feature of the sequence. An example of such input information is information for designating a position (a feature of the sequence) in the latent space FS described above with reference to
The user interface 10 has a function of an output unit (a presentation unit) that presents information to the user. The user interface 10 outputs a generation result of the generation unit 30 explained below. For example, the sequence A and the like are presented (screen display, sound output, or the like) in the form explained above with reference to
The storage unit 20 stores various kinds of information used in the information processing apparatus 1. As an example of the information stored in the storage unit 20, a learned model 21 and an information processing program 22 are illustrated.
The learned model 21 is a learned model generated (learned) using learning data to output data corresponding to a new target sequence xT when data corresponding to the input information explained above is input. It can also be said that the learned model 21 is a machine learning model for generating new sequence databased on the input sequence data. The generation unit 30 generates, from the input information, data corresponding to the input information and inputs the data to the learned model 21. The generation unit 30 generates, from the data output by the learned model 21, a sequence corresponding to the data output by the learned model 21. The input/output data of the learned model 21 includes, for example, a sequence of tokens (a token sequence). In this case, the data input to the learned model 21 includes a token of the context sequence xc. The data output by the learned model 21 includes a token of the new target sequence xT. The token is explained with reference to
On the lower side of the figure, a token sequence corresponding to the music sequence is illustrated. In this example, the token indicates either a pitch value of sound or a duration of the sound. In the token sequence, a first token and a second token are arranged in time order. The first token is a token indicating generation and stop of each kind of sound included in the sequence. The second token is a token indicating a period in which a state indicated by the first token corresponding to the second token is maintained. A portion represented by angle brackets < > corresponds to one token.
For example, a token <ON, W, 60> is a token (the first token) indicating that generation of sound at a pitch value 60 of a sound source W (for example, indicating a type of a musical instrument) starts at time 0. The following token <SHIFT, 1> is a token (the corresponding second token) indicating that a state (the sound source W, the pitch value 60) indicated by the corresponding first token is maintained for one unit time. That is, SHIFT means that only time moves (only time passes) while a state indicated by the immediately preceding token is maintained. Other tokens concerning ON and SHIFT are explained in the same manner. A token <OFF, W, 60> is a token (the first token) indicating that the generation of the sound at the pitch value 60 of the sound source W ends. Other tokens relating to OFF are explained in the same manner. Note that, in this example, an example is explained in which, when a plurality of kinds of sound are present at the same time, tokens are arranged in order from a token corresponding to low sound. Determining the order in this manner makes it easy to learn the learned model 21.
Note that the above is an example of the token of the sequence in the case in which the sequence is music. When the sequence is a language, the token is a word or the like.
The encoder model 211 gives a feature value z. The feature value z may be a vector indicating a position (a point) in the latent space FS. It can be said that the position in the latent space FS indicates a feature of a sequence. The latent space FS is a multidimensional space and is also referred to as a latent feature space or the like. In the embodiment, it can also be said that the latent space FS is a context latent space learned under a condition (with a context condition) that a determined context sequence xc is maintained. The latent space FS in
The sequence x input to the encoder model 211 is illustrated as tokens s1, . . . sk−1, sk, . . . , sj, sj+1, . . . , and sL. The subscripts indicate order of tokens in the sequence. Among the subscripts, the variable j and the variable k give the position information R. First to k−1-th tokens s1 to sk−1 and j-th to L-th tokens sj to sL are specified as positions of the context sequence xc. In other words, k-th to j−1-th tokens sk to sj−1 are specified as positions of the new target sequence xT to be generated later.
In the encoder model 211, only a token of the context sequence xc among the tokens, the positions of which are specified as explained above, is input to the RNN. The RNN outputs the feature value z of the input (token of) the context sequence xc. As explained above, since the encoder model 211 outputs the feature value z when the sequence x and the position information R are input, the encoder model 211 is expressed as “q(z|x,R)” and illustrated.
Like the encoder model 211, the plier model 212 also gives the feature value z. The context sequence xc and the position information R are input to the plier model 212.
The context sequence xc is illustrated as tokens s1, . . . sk−1 and tokens sj+1, . . . , sL. The remaining tokens are given as a predetermined token M. When there are a plurality of remaining tokens, all the tokens may be provided as the same token M. It can also be said that a portion other than the context sequence xc in the sequence x (a portion of the new target sequence xT to be generated later) is masked by the token M. The token M may be decided to give a feature value different from all of the feature values z corresponding to tokens that are likely to be input as tokens of the context sequence xc.
The position information R is as explained above. In this example, first to k−1-th tokens s1 to sk−1 and j-th to L-th tokens sj to sL are specified as positions of the context sequence xc.
In the plier model 212, only the token M among the tokens, the positions of which are specified as explained above, is input to the RNN. The RNN outputs the feature value z of the input token M. As explained above, since the plier model 212 outputs the feature value z when the context sequence xc and the position information R are input, the plier model 212 is expressed as “p(z|xc,R)” and illustrated.
The decoder model 213 generates a token of the new target sequence xT based on the feature value z and the token of the context sequence xc. Specifically, the decoder model 213 reconfigures only the token of the target sequence xT of the context sequence xc and the target sequence xT. The reconfigured token of the target sequence xT and the original determined token of the context sequence xc are combined by, for example, the generation unit 30 and a sequence including the context sequence xc and the new target sequence xT is generated. As explained above, when the feature value z, the context sequence xc, and the position information R are input, the decoder model 213 outputs a sequence in which only the target sequence xT is reconfigured. Therefore, the decoder model 213 is expressed as “p(xT|z,xc,R)” and illustrated.
Note that, in the example illustrated in
The encoder model 211, the plier model 212, and the decoder model 213 explained above are learned to minimize a loss function. In this example, a loss function Lrec and a loss function Lpri are used as the loss function. The parameters of the encoder model 211, the plier model 212, and the decoder model 213 are learned to minimize a total (an addition value or the like) of the loss function Lrec and the loss function Lpri. The loss function Lrec is an error (a reconfiguration error) at the time when the decoder model 213 reconfigures a target sequence using the feature value z output by the plier model 212. The loss function Lpri is a difference (a plier error) in distribution between the encoder model 211 and the plier model 212. An example of the plier error is a Kullback-Leibler (KL) distance.
In step S1, a mini-batch of a sequence is acquired from the learning data. For example, any predetermined number (sixty-four or the like) of sequences x are acquired (sampled) from the learning data.
In step S2, position information is set. For example, the position information R explained above with reference to
In step S3, parameters are updated using the loss function. For example, as explained above with reference to
The learning in step S1 to step S3 explained above is repeatedly executed a predetermined number of times. That is, as depicted in step S4, when the number of times of learning is less than the predetermined number of times (step S4: YES), the processing is returned to step S1. When the number of times of learning reaches the predetermined number of times (step S4: NO), the processing of the flowchart ends.
For example, the learned model 21 is generated as explained above. Note that parameter update may be performed by setting different position information for the same mini-batch. In that case, the processing in step S2 and step S3 may be repeatedly executed by the number of patterns of the position information R to be set.
Returning back to
The generation unit 30 generates a sequence including the determined context sequence xc and the new target sequence xT using the input information input to the user interface 10 and the learned model 21. The sequence to be generated is the generated sequence (the sequence A or the like) explained above with reference to
In step S11, a feature value is acquired (sampled) using the input context sequence and position information and the plier model. For example, the user interface 10 receives the context sequence xc and the position information R as input information according to the operation relating to the items “sequence selection” and “range designation” explained above with reference to
In step S12, a target sequence is generated using the context sequence, the feature value, and the decoder. For example, the generation unit 30 inputs, using the learned model 21, the context sequence xc used in the preceding step S11 and the acquired feature value z to the decoder model 213 as explained above with reference to
In step S13, a sequence including the context sequence and the target sequence is generated. For example, the generation unit 30 combines the context sequence xc used in the preceding step S12 and the generated new target sequence xT and generates a sequence including the context sequence xc and the new target sequence xT.
In step S21, a feature value different from feature values of a designated plurality of sequences is specified. For example, the user interface 10 receives, as input information, the information for designating the sequence A and the sequence B and explained above with reference to
The feature value ZAB may be specified after weighting the feature value ZA and the feature value ZB. For example, the feature value ZAB may be calculated as ZAB=(1-α)ZA+αZB. α indicates a ratio (a blend ratio) of the feature value ZA and the feature value ZB in the feature value ZAB. In this example, (1-α) indicates a ratio of the feature value ZA and α indicates a ratio of the feature value ZB. For example, in a case of α=0.25, a feature value obtained by combining (blending) the feature value ZA and the feature value ZB at 0.75:0.25 is specified as the feature value zAB. For example, the user interface 10 may provide display or the like with which the user can designate α.
In step S22, a target sequence is generated using the specified feature value, the context sequence, and the decoder. For example, the generation unit 30 inputs, using the learned model 21, the feature value zAB specified in the preceding step S21 to the decoder model 213. The decoder model 213 generates a target sequence xTAB corresponding to the feature value zAB. The target sequence xTAB and the context sequence xc obtained in this way are combined and a new sequence AB is generated.
In step S31, a feature value near the feature value of the designated sequence is specified. For example, the user interface 10 receives, as input information, the information for designating the sequence A and the information for designating the “variation generation” in the example in
In step S32, a target sequence is generated using the specified feature value, the context sequence, and the decoder. For example, the generation unit 30 inputs, using the learned model 21, the feature value ZA′ specified in the preceding step S31 to the decoder model 213. The decoder model 213 generates a target sequence xTA′ corresponding to the feature value ZA′. The target sequence xTA′ obtained in this way and the context sequence xc are combined and a new sequence A′ is generated. Note that a plurality of different feature values may be specified in the preceding step S32. In this case, new target sequences and new sequences as many as the number of feature values (the number of variations) are generated. For example, the user interface 10 may provide display or the like with which the user can designate the number of variations.
Note that the sequence on which the variation generation is based and the generated sequence and the sequence on which the interpolation generation is based and the generated sequence sometimes overlap. For example, as explained above, the sequence B is generated by the interpolation generation from the sequence A and the sequence C. The sequence A and the sequence C can be generated by the variation generation from the sequence B.
Besides the normal generation, the interpolation generation, and the variation generation explained above, various generation methods may be used. As a fourth generation method, the generation unit 30 may generate a sequence having a designated feature. For example, as explained above with reference to
By combining the various generation methods explained above, it is possible to search for a desired sequence. This is explained with reference to
A further sequence search is performed based on the sequence A and the like (as a starting point). For example, as illustrated in an upper part of the figure, the interpolation generation may be performed. In this example, a sequence AB (illustrated by a white circle) having an intermediate feature between features of the sequence A and the sequence B and a sequence BC (illustrated by a white circle) having an intermediate feature between features of the sequence B and the sequence C are generated. From the generated sequence AB, the generated sequence BC, and the like, a further sequence may be generated by interpolation generation, variation generation, feature designation, and the like.
Alternatively, variation generation may be performed as illustrated in a middle part of the figure. In this example, a sequence A′, a sequence A″, and a sequence A′″ (all of which are illustrated by white circles) having a feature obtained by adding noise to a feature of the sequence A are generated. From the generated sequence A′, the generated sequence A″, the generated sequence A′″, and the like, a further sequence may be generated by the interpolation generation, the variation generation, the feature designation, and the like.
Alternatively, as illustrated in a lower part of the figure, the feature designation may be performed. In this example, a sequence D, a sequence E, and a sequence F (all of which are illustrated by white circles) having designated features are generated. From the generated sequence D, the generated sequence E, the generated sequence F, and the like, a further sequence may be generated by the interpolation generation, the variation generation, the feature designation, and the like.
For example, as explained above, the user U can repeat the generation of a sequence until obtaining a desired sequence.
As explained above, with the information processing apparatus 1, it is possible to generate a sequence by combining various generation methods. Therefore, it is possible to provide sequence generation excellent in operability. The user U can narrow down sequences to be able to obtain a desired target sequence. For example, the user U can generate a sequence A to a sequence G including different target sequences and further generate, with the interpolation generation, a sequence obtained by blending favorite sequences B and F among the sequences A to G. The user U can improve a favorite target sequence while finely correcting the target sequence. For example, the user U can generate, with the variation generation, a sequence similar to the sequence A but slightly different from the sequence A (for example, the sequence B to the sequence E). The user U can blend, with the interpolation generation, among the generated sequences, a sequence close to an image (for example, the sequence C and the sequence E) to generate a further sequence.
The CPU 1100 operates based on programs stored in the ROM 1300 or the HDD 1400 and controls the units. For example, the CPU 1100 develops the programs stored in the ROM 1300 or the HDD 1400 in the RAM 1200 and executes processing corresponding to various programs.
The ROM 1300 stores a boot program such as a BIOS (Basic Input Output System) executed by the CPU 1100 at a start time of the computer 1000, a program depending on hardware of the computer 1000, and the like.
The HDD 1400 is a computer-readable recording medium that non-transiently records a program to be executed by the CPU 1100, data to be used by such a program, and the like. Specifically, the HDD 1400 is a recording medium that records an information processing program according to the present disclosure, which is an example of program data 1450.
The communication interface 1500 is an interface for the computer 1000 to be connected to an external network 1550 (for example, the Internet). For example, the CPU 1100 receives data from other equipment and transmits data generated by the CPU 1100 to the other equipment via the communication interface 1500.
The input/output interface 1600 is an interface for connecting an input/output device 1650 and the computer 1000. For example, the CPU 1100 receives data from an input device such as a keyboard or mouse via the input/output interface 1600. Further, the CPU 1100 transmits data to an output device such as a display, a speaker, or a printer via the input/output interface 1600. The input/output interface 1600 may function as a media interface that reads a program or the like recorded in a predetermined recording medium (a medium). The medium is, for example, an optical recording medium such as a DVD (Digital Versatile Disc) or a PD (Phase change rewritable Disk), a magneto-optical recording medium such as an MO (Magneto-Optical disk), a tape medium, a magnetic recording medium, or a semiconductor memory.
For example, when the computer 1000 functions as the information processing apparatus 1, the CPU 1100 of the computer 1000 executes an information processing program loaded on the RAM 1200 to thereby realize the functions of the generation unit 30 and the like. The HDD 1400 stores a program according to the present disclosure (the information processing program 22 in the storage unit 20) and data in the storage unit 20. Note that the CPU 1100 reads the program data 1450 from the HDD 1400 and executes the program data. However, as another example, the CPU 1100 may acquire these programs from another device via the external network 1550.
The disclosed technique is not limited to the embodiment explained above. Several modifications are explained.
In the embodiment explained above, an example in which the sequence x is divided into one target sequence and two context sequences (a context sequence xci and a context sequence xc2) in the range designation (
A part of the functions of the information processing apparatus 1 may be realized outside the information processing apparatus 1 (for example, in an external server). In that case, the information processing apparatus 1 may include, in the external server, a part or all of the functions of the storage unit 20 and the generation unit 30. The information processing apparatus 1 communicates with the external server, whereby the processing of the information processing apparatus 1 explained above is realized in the same manner.
The learned model 21 may also include the encoder model 211 as the encoder ENC. In this case, the learned model 21 can be used for, for example, a use for extracting a feature value from the sequence x also including the target sequence explained with reference to
The information processing apparatus 1 explained above is specified, for example, as explained below. As explained with reference to
The information processing apparatus 1 may further include display means (the user interface 10) for displaying, in a designation enabled form, a position in a space (the latent space FS) that defines a feature value of the sequence data (for example, the sequence A) learned by the machine learning model (the learned model 21). The control means (the generation unit 30) may generate sequence data having a feature value corresponding to a designated position in the space (the latent space FS) as new sequence data.
The information processing apparatus 1 is also specified explained below. As explained with reference to
With the information processing apparatus 1 explained above, a sequence including the determined context sequence xc and the new target sequence xT is generated. The context sequence xc configures a part of the sequence and the target sequence xT configures the remainder of the sequence. Therefore, it is possible to generate a sequence in which only a part is generated anew and the remainder is maintained.
As explained with reference to
As explained with reference to
As explained with reference to
As explained with reference to
As explained with reference to
As explained with reference to
The information processing method explained with reference to
The information processing program 22 explained with reference to
The effects described in the present disclosure are only examples and are not limited by the disclosed content. There may be other effects.
Although the embodiment of the present disclosure is explained above, the technical scope of the present disclosure is not limited to the embodiment explained above per se. Various changes are possible without departing from the gist of the present disclosure. Components in different embodiments and modifications may be combined as appropriate.
Note that the present technique can also take the following configurations.
Number | Date | Country | Kind |
---|---|---|---|
2020-219553 | Dec 2020 | JP | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/JP2021/042384 | 11/18/2021 | WO |