The present invention relates to an acoustic signal processing technique for preventing the voice of an utterer from becoming a nuisance to those around him/her.
As an acoustic signal processing technique for preventing the voice of an utterer from becoming a nuisance to those around him/her, there is a technique described in PTL 1. According to the technique described in PTL 1, a disturbing sound (hereinafter referred to as a masking sound) for masking the voice of a far-end utterer reproduced from a speaker so as not to be heard by surrounding people is used to prevent said voice from leaking to the surroundings, and to prevent the masking sound from becoming excessive and becoming a nuisance to those around him/her.
[PTL 1] Japanese Patent Application Publication No. 2009-267799
The technique disclosed in PTL 1 is to reproduce the masking sound so that people in the vicinity cannot hear what is being said. Therefore, the utterer is unable to ascertain at what volume he/she should speak so that the surrounding people cannot hear the utterance content.
Thus, an object of the present invention is to provide a technique for providing feedback to an utterer on the degree of utterance volume.
One aspect of the present invention includes: an utterance volume evaluation unit that generates an evaluation value for a volume of an utterance voice (referred to as utterance volume evaluation value, hereinafter) from a first sound collection signal that is output by a first microphone installed near an utterer to collect an utterance voice, which is a voice of the utterer, and a second sound collection signal that is output by a second microphone installed at a position farther away from the utterer than the first microphone to collect the utterance voice; and a feedback sound signal generation unit that generates a signal for emitting, from a speaker, a feedback sound indicating the degree of a volume of the utterance voice to the utterer (hereinafter referred to as a feedback sound signal), from the first sound collection signal, by using a feedback gain corresponding to the utterance volume evaluation value.
According to the present invention, it is possible to provide feedback to the utterer on the degree of utterance volume.
Embodiments of the present invention will be described hereinafter in detail. Note that components having the same function are denoted by the same numbers, and redundant description thereof will be omitted.
Prior to the description of each embodiment, the notation in this specification will be explained.
{circumflex over ( )}(Caret) denotes superscript. For example, xy{circumflex over ( )}z means that yz is a superscript to x and xy{circumflex over ( )}z means that yz is a subscript to x. _(Underscore) denotes subscript. For example, xy_z means that yz is a superscript to x and xy_z means that yz is a subscript to x.
Superscripts “{circumflex over ( )}” and “˜” as in {circumflex over ( )}x and ˜x for a certain character x would normally be written directly above “x,” but are written as {circumflex over ( )}x or ˜x here due to restrictions on notation in this specification.
An utterance feedback device 100 will be described below with reference to
An operation of the utterance feedback device 100 will be described with reference to
In S110, the utterance volume evaluation unit 110 receives a sound collection signal output by the microphone 910 as an input, generates an evaluation value for the volume of an utterance voice (referred to as an utterance volume evaluation value, hereinafter) from the sound collection signal, and outputs the evaluation value. The utterance volume evaluation unit 110 generates the utterance volume evaluation value by, for example, comparing the power of the sound collection signal with a predetermined threshold. The utterance volume evaluation unit 110 may detect a voice section or suppress noise when calculating the power of the sound collection signal. The utterance volume evaluation value may be a value indicating that the utterance volume is large, a value indicating that the utterance volume is small, or the like.
In S120, the feedback sound signal generation unit 120 inputs the sound collection signal output by the microphone 910 and the utterance volume evaluation value generated in S110, and by using a feedback gain corresponding to the utterance volume evaluation value, generates a signal of a feedback sound emitted from the speaker 920 (hereinafter referred to as a feedback sound signal) from the sound collection signal, and outputs the feedback sound signal. It is known that the utterer utters while listening to a feedback sound generated from his/her own utterance voice, and when the feedback delay becomes 20 ms or more, the delay becomes nuisance, and when the feedback delay exceeds 50 ms, the feedback sound becomes a distraction, making the utterance difficult. Therefore, the feedback sound signal generation unit 120 may generate the feedback sound signal so that, for example, the time from utterance by the utterer to the time the utterer hears the feedback sound is within 20 ms.
Further, the feedback sound signal generation unit 120 may set the feedback gain to a larger value as the utterance volume evaluation value is larger. For example, if the utterance volume evaluation value is a value indicating that the utterance volume evaluation value is excessive, the feedback sound signal may be generated using a feedback gain that causes temporary distortion. In order to determine whether or not the utterance volume evaluation value is a value indicating that the utterance volume evaluation value is excessive, it is preferred to determine whether or not the utterance volume evaluation value exceeds a predetermined threshold.
Further, the feedback sound signal generation unit 120 may process the sound collection signal by using, for example, noise suppression processing, voice clarification processing, or spectrum processing for emphasizing a voice band, to make the feedback sound a sound that can be easily heard by the utterer. When active noise control (ANC) is used as noise suppression processing, the feedback sound signal generation unit 120 may increase the effect of the active noise control as the utterance volume evaluation value becomes larger.
According to the embodiment of the present invention, it is possible to provide feedback to the utterer on the degree of the utterance volume. Thus, the utterer can spontaneously adjust the utterance volume. Moreover, by using noise suppression processing when generating the feedback sound signal, it is possible to adjust the utterance volume in the form of applying the Lombard effect, that is, to suppress utterance in a large voice under noise.
An utterance feedback device 200 will be described below with reference to
The operation of the utterance feedback device 200 will be described with reference to
In S210, the utterance volume evaluation unit 210 inputs a first sound collection signal output by the first microphone 910-1 and a second sound collection signal output by the second microphone 910-2, generates an evaluation value for the volume of the utterance voice (hereinafter referred to as an utterance volume evaluation value) from the first sound collection signal and the second sound collection signal, and outputs the evaluation value. The utterance volume evaluation unit 210 generates the utterance volume evaluation value by, for example, comparing the power of the second sound collection signal with a predetermined threshold. The utterance volume evaluation unit 210 utilizes a voice section that is detected by using the first sound collection signal in order to eliminate the influence of noise when obtaining the power of the second sound collection signal. By generating the utterance volume evaluation value on the basis of the power of the second sound collection signal, the utterance volume evaluation unit 210 can generate the utterance volume evaluation value in consideration of the attenuation effect on the utterance voice by the partition when the partition is installed.
In S120, the feedback sound signal generation unit 120 inputs the first sound collection signal output by the first microphone 910-1 and the utterance volume evaluation value generated in S210, and by using a feedback gain corresponding to the utterance volume evaluation value, generates a signal of a feedback sound emitted from the speaker 920 (hereinafter referred to as a feedback sound signal) from the first sound collection signal, and outputs the feedback sound signal.
According to the embodiment of the present invention, it is possible to provide feedback to the utterer on the degree of the utterance volume. The utterance volume evaluation value can be generated more accurately by obtaining the power of the second sound collection signal by utilizing the voice section detected by using the first sound collection signal in which the utterance voice is mainly collected and surrounding noise is relatively small.
An utterance feedback device 300 will be described below with reference to
The operation of the utterance feedback device 300 will be described with reference to
In S310, the howling prevention unit 310 receives a sound collection signal output by a microphone 910 as an input, generates, from the sound collection signal, a howling evaluation value indicating the possibility of occurrence of howling when a feedback sound is emitted from the speaker, and outputs the howling evaluation value.
In S320, the feedback sound signal generation unit 320 inputs the sound collection signal output by the microphone 910, the utterance volume evaluation value generated in S110, and the howling evaluation value generated in S310, and by using a feedback gain corresponding to the utterance volume evaluation value and the howling evaluation value, generates a signal of a feedback sound emitted from the speaker 920 (hereinafter referred to as a feedback sound signal) from the sound collection signal, and outputs the feedback sound signal. The feedback sound signal generation unit 320 sets the feedback gain to a smaller value as the howling evaluation value is a value indicating that the howling evaluation value is larger.
The utterance feedback device may be connected to two microphones.
An utterance feedback device 301 will be described below with reference to
The operation of the utterance feedback device 301 will be described with reference to
In S310, the howling prevention unit 310 receives the first sound collection signal output by the first microphone 910-1 as an input, generates, from the first sound collection signal, a howling evaluation value indicating the possibility of occurrence of howling when a feedback sound is emitted from the speaker, and outputs the howling evaluation value.
In S320, the feedback sound signal generation unit 320 inputs the first sound collection signal output by the first microphone 910-1, the utterance volume evaluation value generated in S110, and the howling evaluation value generated in S310, and by using a feedback gain corresponding to the utterance volume evaluation value and the howling evaluation value, generates a signal of a feedback sound emitted from the speaker 920 (hereinafter referred to as a feedback sound signal) from the first sound collection signal, and outputs the feedback sound signal.
The utterance feedback device may be connected to a microphone array and a speaker array instead of the microphone and the speaker.
An utterance feedback device 302 will be described below with reference to
The operation of the utterance feedback device 302 will be described with reference to
In S305, the microphone array processing unit 305 receives N sound collection signals output by the N microphones included in the microphone array 911 as an input, generates an integrated sound collection signal from the N sound collection signals, and outputs the integrated sound collection signal. It is preferred that, by using predetermined signal processing, for example, the microphone array processing unit 305 form directivity in the direction of the utterer and a dead angle in the direction of the speakers included in the speaker array 921, to generate the integrated sound collection signal.
In S325, the speaker array processing unit 325 inputs the feedback sound signal generated in S320, generates, from the feedback sound signal, M individual feedback sound signals for emitting sound from a speaker included in the speaker array 921, and outputs the M individual feedback sound signals. It is preferred that, by using predetermined signal processing, for example, the speaker array processing unit 325 form directivity in the direction of the utterer and a dead angle in the direction of the microphones included in the microphone array 911, to generate the M individual feedback sound signals. The directions of the utterer and the microphones included in the microphone array 911 may be obtained by any method, and for example, the direction of the utterer can be obtained by sound source direction estimation by the microphone array processing unit 305. When information on the positions of the utterer and the microphones included in the microphone array 911 are obtained, the directions of the utterer and the microphones included in the microphone array 911 may be determined. The information on the positions of the utterer and the microphones included in the microphone array 911 may be obtained from a system (not shown) for estimating the positions from an image photographed by a camera, or when the information of the positions is obtained in advance, said information may be used.
The howling evaluation value can be generated more accurately by forming directivity by using the microphone array or the speaker array.
According to the embodiment of the present invention, it is possible to provide feedback to the utterer on the degree of the utterance volume. By preventing howling, the utterer can adjust the utterance volume more accurately and spontaneously.
An utterance feedback device 400 will be described below with reference to
The operation of the utterance feedback device 400 will be described with reference to
In S410, the utterance evaluation unit 410 receives a sound collection signal output by the microphone 910 as an input, generates an evaluation value for the utterance voice (hereinafter referred to as an utterance evaluation value), from the sound collection signal, and outputs the evaluation value.
The utterance evaluation unit 410 will be described below with reference to
The operation of the utterance evaluation unit 410 will be described with reference to
In S110, the utterance volume evaluation unit 110 receives a sound collection signal output by the microphone 910 as an input, generates an evaluation value for the volume of the utterance voice (referred to as an utterance volume evaluation value, hereinafter) from the sound collection signal, and outputs the evaluation value.
In S412, the utterance intelligibility evaluation unit 412 receives a sound collection signal output by the microphone 910 as an input, and generates, from the sound collection signal, an evaluation value for intelligibility of the utterance voice (hereinafter referred to as an utterance intelligibility evaluation value), and outputs the utterance intelligibility evaluation value. For example, a short-time objective intelligibility (STOI) or a voice recognition score can be used as the utterance intelligibility evaluation value.
In S414, the utterance evaluation value calculation unit 414 inputs the utterance volume evaluation value generated in S110 and the utterance intelligibility evaluation value generated in S412, calculates a weighted sum of the utterance volume evaluation value and the utterance intelligibility evaluation value, and outputs the sum as the utterance evaluation value.
In S420, the feedback sound signal generation unit 420 inputs the sound collection signal output by the microphone 910 and the utterance evaluation value generated in S410, and by using a feedback gain corresponding to the utterance evaluation value, generates a signal of a feedback sound emitted from the speaker 920 (hereinafter referred to as a feedback sound signal) from the sound collection signal, and outputs the feedback sound signal.
The utterance feedback device may be configured to provide feedback using visual information instead of using sound. In this case, the utterance feedback device 400 includes a feedback information generation unit 421 (not shown) instead of the feedback sound signal generation unit 420. The feedback information generation unit 421 inputs the utterance evaluation value generated in S410, and generates and outputs information indicating that the volume of utterance is large, when the utterance evaluation value is larger than a predetermined threshold.
According to the embodiment of the present invention, it is possible to provide feedback to the utterance on the degree of annoyance of utterance based on the volume and intelligibility of the utterance. By using the utterance evaluation value that also considers the intelligibility of the utterance, for example, even if the volume of the utterance is low and the content of the utterance can be heard, it is possible to provide feedback even on annoying utterances that are disturbing to the surrounding people.
The device of the present invention includes, for example, as single hardware entities, an input unit to which a keyboard or the like can be connected, an output unit to which a liquid crystal display or the like can be connected, a communication unit to which a communication device (for example, a communication cable) capable of communication with the outside of the hardware entity can be connected, a CPU (Central Processing Unit, which may include a cache memory, a register, and the like), a RAM or a ROM that is a memory, an external storage device that is a hard disk, and a bus for connecting the input unit, the output unit, the communication unit, the CPU, the RAM, the ROM, and the external storage device so as to allow data communication therebetween. Also, if necessary, the hardware entity may be provided with a device (drive) or the like capable of reading and writing data from/to a recording medium such as a CD-ROM. Examples of a physical entity including such hardware resources include a general-purpose computer.
A programs required to implement the above-described functions, data required to process the program, and the like are stored in the external storage device of the hardware entity (the present invention is not limited to the external storage device and, for example, the program may be stored in the ROM which is a read-only storage device). Further, data or the like obtained by processing the program is appropriately stored in the RAM, the external storage device, or the like.
In the hardware entity, each program stored in the external storage device (or ROM, etc.) and data necessary for the processing of each program are read into a memory as necessary, and interpreted, executed, and processed by the CPU as appropriate. As a result, the CPU realizes predetermined functions (the constitutional units described above as units, means, etc.).
The present invention is not limited to the above-described embodiments, and appropriate changes can be made without departing from the spirit of the present invention. Further, the processes described in the embodiments are not only executed in time series in the described order, but also may be executed in parallel or individually according to a processing capability of a device that executes the processes or as necessary.
As described above, when a processing function in the hardware entity (the device according to the present invention) described in the above-described embodiments is implemented by a computer, processing content of the function to be included in the hardware entity is described by the program. By executing this program on the computer, the processing function in the above-described hardware entity is implemented on the computer.
A program describing this processing content can be recorded on a computer-readable recording medium. Examples of the computer-readable recording medium may include any recording medium such as a magnetic recording device, an optical disc, a magneto-optical recording medium, and a semiconductor memory. Specifically, for example, a hard disk device, a flexible disk, or a magnetic tape can be used as the magnetic recording device, a DVD (Digital Versatile Disc), a DVD-RAM (Random Access Memory), a CD-ROM (Compact Disc Read Only Memory), or a CD-R (Recordable)/RW (ReWritable) or the like can be used as the optical disk, an MO (Magneto-Optical disc) or the like can be used as the magneto-optical recording medium, and an EEP-ROM (Electronically Erasable and Programmable-Read Only Memory) or the like can be used as the semiconductor memory.
The program is distributed, for example, by sales, transfer, or lending of a portable recording medium such as a DVD or a CD-ROM on which the program is recorded. In addition, the distribution of the program may be performed by storing the program in advance in a storage device of a server computer and transferring the program from the server computer to another computer via a network.
The computer that executes such a program first temporarily stores, for example, the program recorded on the portable recording medium or the program transferred from the server computer, in a storage device of the computer. When the computer executes the processing, the computer reads the program stored in the storage device of the computer and executes processing according to the read program. Further, as another embodiment of the program, the computer may directly read the program from the portable recording medium and execute processing according to the program, and further, processing according to a received program may be sequentially executed each time the program is transferred from the server computer to the computer. Furthermore, instead of transferring the program to the computer from a server computer, the processing described above may be executed by a so-called ASP (Application Service Provider) type service, in which a processing function is realized by execution commands and result acquisition alone. Note that the program in the present embodiment includes information to be used for processing by an electronic computer and equivalent to the program (data which is not a direct command to the computer but has a property that regulates the processing of the computer and the like).
Further, although the hardware entity is configured by executing a predetermined program on the computer in the present embodiment, at least a part of the processing content of the hardware entity may be realized in hardware.
The above description of the embodiments of the present invention is presented for the purpose of illustration and description. There is no intention to be exhaustive and there is no intention to limit the present invention to the disclosed precise form. Modifications or variations are possible from the above-described teachings. The embodiments are selectively represented in order to provide the best illustration of the principle of the present invention and in order for those skilled in the art to be able to use the present invention in various embodiments and with various modifications so that the present invention is suitable for deliberated practical use. All of such modifications or variations are within the scope of the present invention defined by the appended claims interpreted according to a width given fairly, legally and impartially.
| Filing Document | Filing Date | Country | Kind |
|---|---|---|---|
| PCT/JP2021/029278 | 8/6/2021 | WO |