The present invention relates to a generation device, a generation method, and a generation program.
Digital twin (DT) technique for mapping a real space target onto cyber space is realized by the progress of ICT (Information and Communication Technology) technology and is attracting attention. The DT is configured to map and accurately express a shape, a state, a function, and the like of a real world target such as a production machine in a factory, an engine of an aircraft, an automobile.
By using this DT, current analysis, future prediction, simulation of possibility, and the like regarding the target can be performed in the cyber space. Further, cyber space benefit such as intelligently controlling the real world target on the basis of the result, for example, benefit of facilitating utilization of the ICT technology, can be fed back to the real world target.
In the future, increasing use of converting various real world targets to DTs is expected to increase demand for cooperation between industries and large-scale simulations by interacting (interaction) or combining different and diverse DTs across industries.
When a user himself or herself uses the DT or the DT is used by the DT (human DT) of the user, the user or the human DT need to recognize the interaction between DTs as in the real world.
The human recognizes a real world event by using five senses. For this reason, in order to recognize the interaction between DTs, it is necessary to digitize visual, auditory, olfactory and taste information of the interaction. In this case, attention is paid to digitization of sound information related to the auditory information.
In an example shown in
In order to generate the sound described in
Subsequently, the interaction which can be performed by the virtual object is defined (step S11). In an example shown in
Although the above-mentioned prior art is a technique for manually assigning the sound, there is a prior art for analyzing video to synthesize an effect voice.
As shown in
In the prior art, sound textures for the selected material and interaction are generated (step S21). The sound textures include sound texture so1 of each material and sound texture so2 of each interaction.
In the prior art, a two-dimensional video 17 is analyzed by using an NN (Neural Network) 16, and an appropriate sound 18 obtained by combining sound textures so1 and so2 is synthesized (step S22).
In the prior art described in
Unlike the two-dimensional sound synthesis, in the sound synthesis in the three-dimensional space, the sound is unnatural to the user unless the position of the sound source is appropriately set.
The present invention has been made in view of the above, and an object of the present invention is to provide a generation device, a generation method, and a generation program capable of synthesizing a realistic sound in the three-dimensional space.
In order to solve the above-described problem and achieve the object, it is characterized in that a generation device include a coefficient acquisition unit that acquires coefficient information of an object on the basis of raw material information of the object when detecting an interaction of the object mapped on cyber space, a sound source selection unit that calculates statistical information in which a type of a sound source corresponding to the interaction of the object and intensity of the sound source are associated with each other and selects sound source information corresponding to the statistical information by inputting position information, shape information, and the coefficient information of the object to a machine learning model, and a voice synthesis unit that generates synthesized sound source information obtained by synthesizing the sound source information on the basis of the statistical information and the sound source information and generates three-dimensional sound source information obtained by converting the synthesized sound source information into three-dimension by executing three-dimensional sound rendering on the basis of the position information to the synthesized sound source information.
According to the present invention, the realistic sound can be synthesized even in the three-dimensional space.
Hereinafter, examples of the generation device, the generation method, and the generation program disclosed in the present application will be described in detail with reference to the drawings. Note that these examples are not intended to limit the scope of the present invention.
In the DT (Digital Twin) technique, a real space object is mapped onto cyber space (digital three-dimensional space). Data of the object mapped onto the cyber space is expressed as DT data. A user can view the DT data of the object mapped onto the cyber space by using VR (Virtual Reality) or AR (Augmented Reality).
The position is position coordinates (x, y, z) of the object which uniquely specifies the position of the object. The posture is posture information (yaw, roll, pitch) of the object which uniquely specifies an orientation of the object. The shape is mesh information or geometry information representing the shape of a three-dimensional object to be displayed. The appearance is color information of an object surface. The material is information indicating the material of the object. The mass is information indicating the mass of the object.
A generation device according to the present example acquires coefficient information (friction coefficient, attenuation coefficient) of the object on the basis of raw material information (information including the shape, the mass, and the material) of the object when detecting an interaction of the object in the cyber space.
The generation device calculates statistical information in which a type of a sound source corresponding to the interaction of the object and intensity of the sound source are associated with each other, and selects sound source information corresponding to the statistical information from a sound source DB by inputting the position information, the shape information, and the coefficient information of the object to a trained machine learning model.
The generation device generates synthesized sound source information obtained by synthesizing the sound source information on the basis of the statistical information and the sound source information. The generation device generates three-dimensional sound source information by executing three-dimensional sound rendering on the basis of the position information of the object to the synthesized sound source information, and outputs the three-dimensional sound source information to a device such as a VR device or an AR device.
The generation device executes the above-mentioned processing to generate and output the realistic three-dimensional sound source information. The three-dimensional sound source information is a natural sound source for a user experiencing the cyber space.
Next, a configuration example of the generation device according to the present example will be described.
The device 50 outputs the DT data (time-series DT data) corresponding to each object to the generation device 100. In addition, the device 50 acquires the three-dimensional sound source information generated by the generation device 100, and generates the three-dimensional sound source in the cyber space.
In the example shown in
The generation device 100 includes an interaction detection unit 110, a physical information acquisition unit 110a, an object extraction unit 110b, a coefficient acquisition unit 120, a sound source selection unit 130, and a voice synthesis unit 140.
The interaction detection unit 110 acquires the DT data of each object from the device 50, and detects the interaction of the object. It is assumed that the interaction is defined in advance. For example, the interaction detection unit 110 detects a collision (interaction) when a distance between a plurality of objects becomes less than a threshold value.
The interaction detection unit 110 outputs the target object information to the physical information acquisition unit 110a, and outputs the scenario information to the object extraction unit 110b when detecting the interaction. The target object information includes the DT data of fixed time interval before and after the detection of the interaction that is DT data of the object related to the interaction. The scenario information includes a type of the interaction, DT data of the object related to the interaction, and the like.
The physical information acquisition unit 110a calculates the shape information, the position information, and movement information of the object at the time of interaction detection on the basis of the target object information. For example, the shape of the object includes information on a collision area. The position information includes information on the three-dimensional position of the object on the cyber space at a point of time when the interaction is detected. The movement information includes information on a movement speed and a movement direction immediately before the interaction is detected.
The physical information acquisition unit 110a outputs the shape information, the position information, and the movement information to the sound source selection unit 130. The physical information acquisition unit 110a outputs the position information and the movement information to the voice synthesis unit 140.
The object extraction unit 110b extracts raw material information of the object related to the interaction on the basis of the scenario information. The raw material information includes information on the shape, the mass, and the material of the object. The object extraction unit 110b outputs the extracted raw material information to the coefficient acquisition unit 120 and the sound source selection unit 130.
The coefficient acquisition unit 120 has a material DB (DataBase) 120a and an acquisition unit 120b.
The material DB 120a stores information on the friction coefficient and the attenuation coefficient of substance corresponding to combination of the shape, the mass, and the material of the object.
The acquisition unit 120b acquires the friction coefficient and the attenuation coefficient corresponding to the raw material information on the basis of the raw material information (shape, mass, material) and the material DB 120a. The acquisition unit 120b sets the acquired friction coefficient and attenuation coefficient to coefficient information, and outputs the coefficient information to the sound source selection unit 130.
The sound source selection unit 130 has a sound source DB 130a and a selection unit 130b.
The sound source DB 130a stores a plurality of sound elements. The sound elements include sinusoidal information and sound texture information of each different frequency. The sound textures include information related to small recordings and cochleagram.
For example, the sound source DB 130a stores the sound sources of the sound elements of a plurality of materials and the sound sources of the sound elements of a plurality of interactions.
There are 23 types of materials, for example, Brick, Carpet, Ceramic, Fabric, Foliage, Food, Blass, Hair, Leather, Metal, Mirror, Other, Painted, Paper, Plastic, Pol, Stone, Skin, Sky, Tile, Wallpaper, Water, and Wood.
The selection unit 130b acquires the position information and the shape information of the object from the physical information acquisition unit 110a. The selection unit 130b acquires the coefficient information from the coefficient acquisition unit 120.
The selection unit 130b inputs the position information, the shape information, and the coefficient information of the object to the trained machine learning model to calculate the statistical information corresponding to the interaction of the object. The selection unit 130b selects the sound source information corresponding to the statistical information from the sound source DB 130a. The selection unit 130b outputs the statistical information and the sound source information to the voice synthesis unit 140.
The selection unit 130b performs dimension reduction by performing PCA (Principal Component Analysis) to the object related information 30. The selection unit 130b may perform the dimension reduction by using any conventional technique. The selection unit 130b may select any information among (1) to (6) to perform the dimension reduction.
The selection unit 130b calculates the statistical information 36 by inputting the object related information 30 subjected to dimension reduction to the machine learning model 35. The machine learning model 35 corresponds to a Recurrent Neural Network, a Convolutional Neural Network, etc.
For example, the identification information of each sound source is associated with the intensity of the sound source in the statistical information. The statistical information 36 includes the sound elements related to the materials se_m1 20%, se_m2 70%, and se_m3 10% and the sound elements related to the interactions se_i1 80% and se_i2 20%. In the statistical information 36, when the intensities of the sound elements related to the materials are summed up, the sum becomes 100%. In addition, the intensities of the sound elements related to the interactions are summed up, the sum becomes 100%.
The selection unit 130b selects the sound source information corresponding to the statistical information 36 from the sound source DB 130a. The selection unit 130b outputs the statistical information (for example, the statistical information 36) and the sound source information to the voice synthesis unit 140.
Here, it is assumed that the machine learning model 35 used by the selection unit 130b is trained in advance on the basis of teacher data composed of a set of input data and a correct answer label. The input data corresponds to the object related information 30. The correct answer label is information in which the identification information of the sound source and the intensity of the sound source are associated with each other.
Let us return to the description of
The voice synthesis unit 140 has a synthesis unit 140a and a rendering unit 140b.
The synthesis unit 140a generates synthesized sound source information obtained by synthesizing the sound source information on the basis of the statistical information and the sound source information. For example, the synthesis unit 140a connects the sound elements of the sound source information to generate the synthesized sound source information by using NPL 2.
The synthesis unit 140a outputs the generated synthesized sound source information to the rendering unit 140b.
The rendering unit 140b generates the three-dimensional sound source information (3D waveform) obtained by converting the synthesized sound source information into three-dimension by executing three-dimensional sound rendering on the basis of the synthesized sound source information, the position information and the movement information. For example, the rendering unit 140b executes the three-dimensional sound rendering by using NPL 3. The rendering unit 140b outputs the three-dimensional sound source information to a transmission unit 150.
The transmission unit 150 transmits the three-dimensional sound source information to the device 50.
Next, processing procedure of the generation device 100 according to the present example will be described.
On the other hand, the interaction detection unit 110 shifts to a step S103 when detecting the interaction (step S102, Yes). The physical information acquisition unit 110a of the generation device 100 calculates the shape information, the position information, and the movement information related to the interacted object on the basis of the target object information acquired from the interaction detection unit 110 (step S103).
The object extraction unit 110b of the generation device 100 extracts the raw material information related to the interacted object on the basis of the scenario information acquired from the interaction detection unit 110 (step S104).
The coefficient acquisition unit 120 of the generation device 100 executes coefficient acquisition processing (step S105). The sound source selection unit 130 of the generation device 100 executes sound source selection processing (step S106). The voice synthesis unit 140 of the generation device 100 executes voice synthesis processing (step S107).
The transmission unit 150 of the generation device 100 transmits the three-dimensional sound source information to the device 50 (step S108).
Next, an example of the processing procedure of the coefficient acquisition processing shown in step S105 of
The acquisition unit 120b retrieves the friction coefficient and the attenuation coefficient from the material DB 120a on the basis of the raw material information (step S202). The acquisition unit 120b outputs the friction coefficient and the attenuation coefficient to the sound source selection unit 130 (step S203).
Next, an example of the processing procedure of the sound source selection processing shown in the step S106 of
The selection unit 130b executes the dimension reduction to the object related information (step S302). The selection unit 130b inputs the object related information subjected to the dimension reduction to the machine learning model, and calculates the statistical information (step S303).
The selection unit 130b acquires the sound source information corresponding to the statistical information from the sound source DB 130a (step S304). The selection unit 130b outputs the statistical information and the sound source information to the voice synthesis unit 140 (step S305).
Next, an example of the processing procedure of the voice synthesis processing shown in the step S107 of
The synthesis unit 140a generates the synthesized sound source information by connecting the sound elements of the sound source information on the basis of the statistical information (step S402). The rendering unit 140b of the voice synthesis unit 140 executes the three-dimensional sound rendering on the basis of the synthesized sound source information, the position information, and the movement information, and generates the three-dimensional sound source information (step S403).
The rendering unit 140b outputs the three-dimensional sound source information (step S404).
Next, the effect of the generation device 100 according to the present example will be described. The generation device 100 acquires the coefficient information (friction coefficient, attenuation coefficient) of the object on the basis of the raw material information (information including the shape, the mass, and the material) of the object when detecting the interaction of the object on the cyber space. The generation device 100 calculates the statistical information in which the type of the sound source corresponding to the interaction of the object and the intensity of the sound source are associated with each other, and selects the sound source information corresponding to the statistical information from the sound source DB 130a by inputting the position information, the shape information, and the coefficient information of the object to the trained machine learning model. The generation device 100 generates the synthesized sound source information obtained by synthesizing the sound source information on the basis of the statistical information and the sound source information, generates the three-dimensional sound source information by executing the three-dimensional sound rendering on the basis of the position information of the object on the synthesized sound source information, and outputs the synthesized sound source information to the device 50.
The generation device 100 can generate and output the realistic three-dimensional sound source information by executing the above-mentioned processing. The three-dimensional sound source information is the natural sound source for the user experiencing the cyber space by using the device 50.
The generation device 100 executes the dimension reduction on the position information, the shape information, and the coefficient information, and calculates the statistical information by inputting information to be an execution result of the dimension reduction into the machine learning model. Accordingly, the cost for using the machine learning model can be reduced.
The generation device 100 acquires the speed of the collision part and the acceleration of the collision part of the two objects, and the positions of the two objects as the position information when detecting the collision of the two objects as the interaction. By using such position information, the three-dimensional sound source information on the cyber space can be generated with high accuracy.
Subsequently, an example of a computer that performs a generation program will be described below.
The memory 1010 includes a ROM (Read Only Memory) 1011 and a RAM 1012. The ROM 1011 stores, for example, a boot program such as a BIOS (Basic Input Output System). The hard disk drive interface 1030 is connected to a hard disk drive 1031. The disc drive interface 1040 is connected to a disc drive 1041. A removable storage medium such as a magnetic disc or an optical disc is inserted into the disc drive 1041, for example. A mouse 1051 and a keyboard 1052 are connected to the serial port interface 1050, for example. A display 1061 is connected to the video adapter 1060, for example.
Here, an OS 1091, an application program 1092, a program module 1093, and program data 1094 are stored in the hard disk drive 1031, for example. Each piece of information described in the above-mentioned embodiment is stored in the hard disk drive 1031 or the memory 1010, for example.
In addition, the generation program is stored in the hard disk drive 1031, for example, as the program module 1093 in which instructions executed by the computer 1000 are described. Specifically, the program module 1093 in which respective processes for executing the interaction detection unit 110, the physical information acquisition unit 110a, the object extraction unit 110b, the coefficient acquisition unit 120, the sound source selection unit 130, the voice synthesis unit 140, the transmission unit 150 are described is stored in the hard disk drive 1031.
Further, data used for information processing by the generation program is stored in, for example, the hard disk drive 1031 as the program data 1094. Thereafter, the CPU 1020 reads the program module 1093 and the program data 1094 stored in the hard disk drive 1031 into the RAM 1012 when necessary, and executes each procedure described above.
Note that the storage of the program module 1093 and the program data 1094 related to the generation program is not limited to the case where the program module 1093 and the program data 1094 are stored in the hard disk drive 1031, and the program module 1093 and the program data 1094 may be stored in, for example, a detachable storage medium and may be read by the CPU 1020 via the disc drive 1041. Alternatively, the program module 1093 and the program data 1094 related to the generation program may be stored in another computer connected via a network such as a LAN or a WAN (Wide Area Network) and may be read by the CPU 1020 via the network interface 1070.
Although the embodiments to which the invention made by the present inventor is applied have been described above, the present invention is not limited by the descriptions and drawings forming a part of the disclosure of the present invention according to the embodiments. That is to say, other embodiments, examples, operation techniques, and the like made by those skilled in the art on the basis of the embodiment are all included in the scope of the present invention.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/JP2021/047419 | 12/21/2021 | WO |