The present invention relates to a voice recognition technology, and particularly relates to a switching technology between an enhancement signal and an observation signal.
In recent years, the performance of voice recognition has been improved by the development of deep learning technology. However, an example of a situation in which voice recognition is still difficult is mixed voice (overlapping speech) of a plurality of persons. In order to cope with this, the following techniques have been devised.
The blind sound source separation enables voice recognition by separating a voice as a mixed voice that is difficult to recognize into speakers' respective voices (see, for example, Non Patent Literature 1).
The target speaker extraction uses a speech preregistered by a target speaker as auxiliary information, and acquires only a voice of the preregistered speaker from a mixed sound (for example, see Non Patent Literature 2). The extracted voice includes only the voice of the target speaker, and thus voice recognition is possible. However, when the undesirable sounds are removed, the voice of the target speaker may be distorted. That is, there is a case where voice recognition performance is deteriorated by performing voice enhancement.
A method of weakening the intensity of voice enhancement for a section in which an overlap speech does not occur has been proposed (see, for example, Non Patent Literature 3). Although the voice enhancement is effective for the overlap speech, there is a high possibility that the voice recognition is deteriorated when the voice enhancement is performed on the non-overlap speech (the independent speech of the target speaker).
However, the effect of the voice enhancement is not determined only by the presence or absence of the overlap speech. For example, even in the overlap speech section, if there is a large difference in volume between the volume of the target speaker and the interference speaker, which is another speaker, voice recognition tends to recognize only the voice of the target speaker with a large volume. In this case, it is considered that a high voice recognition rate can be obtained as a result by performing voice recognition on the observation signal as it is without performing voice enhancement. Similarly, it is also conceivable that a result of a higher voice recognition rate can be obtained in the input to which voice enhancement is applied even in the section of the non-overlap speech. In view of the above problems, an object of the present invention is to provide a technology capable of improving voice recognition performance.
In order to solve the problem described above, a voice signal processing method according to an aspect of the present invention includes: acquiring an output value indicating whether to perform voice enhancement on an observation signal in which a voice or noise of another speaker overlaps with a voice of a target speaker, or indicating a degree of necessity of performing the voice enhancement; and deciding, under a predetermined condition, a ratio between the observation signal and an enhancement signal generated by the voice enhancement using the acquired output value to determine an input signal to be used for the voice recognition.
According to the present invention, voice recognition performance can be improved.
First, a notation method in this specification will be described.
The symbol “{circumflex over ( )}” (superscripted tilde) used in the text would normally be written immediately above the immediately following character, but is written immediately before the character due to limitations of text notation. In a mathematical formula, these symbols are placed in the rightful positions, that is, directly above the characters. For example, “˜S” is expressed by the following expression in the mathematical expression.
In addition, a symbol “{circumflex over ( )}” (superscripted hat) used in this specification is also described immediately before the character. In a mathematical formula, these symbols are placed in the rightful positions, that is, directly above the characters. For example, “{circumflex over ( )}k” is expressed by the following expression in the mathematical expression.
Hereinafter, an embodiment of the present invention will be described in detail. Constituents that have the same functions are denoted by the same reference numerals, and redundant description will be omitted.
Hereinafter, the voice signal processing method performed by the voice signal processing device 1 according to the embodiment will be described with reference to
In step S11, the voice enhancement unit 11 performs voice enhancement processing. That is, the voice enhancement unit 11 acquires an observation signal as an input, extracts only desired voice from the acquired observation signal using a known voice enhancement technology, and performs voice enhancement processing. As a method for extracting a desired voice, for example, a known target speaker extraction technology can be used. As illustrated in
In step S12, the switching model unit 12 receives the enhancement signal from the voice enhancement unit 11. The switching model unit 12 also receives an observation signal that is a voice signal that has not been subjected to the voice enhancement processing of the voice enhancement unit 11. The observation signal is configured to be directly input to the switching model unit 12 as illustrated in
The switching model unit 12 is a learned model learned using a technology such as a known deep neural network. The signal received as an input by the switching model unit 12 can be a signal of a waveform region. It also can be a signal that has been subjected to feature extraction. At least one of the observation signal and the enhancement signal is input to the switching model unit 12, and the switching model unit 12 outputs whether to perform the voice enhancement from the viewpoint of the voice recognition performance or the degree of necessity of performing the voice enhancement. {circumflex over ( )}k, which is the output of the switching model unit 12, is a value (estimated value) calculated by the switching model unit 12, and can be, for example, a scalar value in a range of 0 to 1 defined by the following expression.
The switching model unit 12 may be configured to calculate {circumflex over ( )}k as an output as a time-series vector. Since {circumflex over ( )}k that is the output is calculated as a time-series vector, it is possible to adopt different weights for each time, and it is possible to more finely determine the input of the voice recognition.
The switching model unit 12 outputs {circumflex over ( )}k, which is the calculated result, to the voice recognition input determination unit 13. A learning method of the switching model unit 12 will be described later.
In step S13, the voice recognition input determination unit 13 receives the output value {circumflex over ( )}k received from the switching model unit 12 and {circumflex over ( )}S from the voice enhancement unit 11, and determines the input of the voice recognition.
Here, when the input to the voice recognition unit 14 is set as ˜S, the input ˜S to the voice recognition unit 14 is determined as either the enhancement signal {circumflex over ( )}S or the observation signal Y as defined by the following expression. In Expression (2), λ is a preset value in a range of 0<λ<1, such as 0.5. In the present embodiment, a method of determining one of the enhancement signal {circumflex over ( )}S and the observation signal Y as ˜S that is an input to the voice recognition unit 14 in this manner will be referred to as a “hard method”.
˜S, which is the input of the voice recognition, may be determined by weighting and adding the enhancement signal {circumflex over ( )}S and the observation signal Y using the output value {circumflex over ( )}k of the switching model unit 12 as defined by the following expression. In the present embodiment, a method of determining ˜S which is an input to the voice recognition unit 14 by weighting and adding the enhancement signal {circumflex over ( )}S and the observation signal Y using the output value {circumflex over ( )}k will be referred to as a “soft method”.
The voice recognition input determination unit 13 outputs ˜S determined by the hard method or the soft method to the voice recognition unit 14.
In step S14, the voice recognition unit 14 performs voice recognition processing on the signals ˜S received from the voice recognition input determination unit 13. The voice recognition unit 14 may receive the enhancement signal {circumflex over ( )}S obtained by the voice enhancement unit 11 and the observation signal Y including speech, noise, or the like of another speaker, and perform voice recognition processing on each of them. The voice recognition unit 14 outputs text information that is a voice recognition result corresponding to each voice signal. A known voice recognition technology can be used as the voice recognition unit 14.
A specific processing flow of the voice recognition input determination processing (
In step S131, the output acquisition unit 131 receives the output value {circumflex over ( )}k from the switching model unit 12. The output acquisition unit 131 transmits the received output value {circumflex over ( )}k to the decision unit 132. In step S132, the decision unit 132 performs predetermined decision using the received output value {circumflex over ( )}k, and outputs a decision result to the determination unit 133. In the predetermined decision, for example, in a case where a hard method is adopted, the magnitude of {circumflex over ( )}k is determined, and only one signal of {circumflex over ( )}S or Y is output to the determination unit 133 by the decision using the above expressions (1) and (2). When the soft method is adopted, signals of {circumflex over ( )}S and Y are output to the determination unit 133 in addition to the value of {circumflex over ( )}k. As another example, information indicating one of the soft method and the hard method that is to be adopted, a value of {circumflex over ( )}k, and signals of {circumflex over ( )}S and Y may be output to the determination unit 133. In step S133, the determination unit 133 determines the input signal ˜S by using the information received from the decision unit 132 and Expressions (1) to (3) described above.
The learning method of the switching model unit 12 in the embodiment of the present invention is performed using the switching model learning device illustrated in
In step S21, the switching model unit 21 receives the observation signal and the enhancement signal for learning, a basic configuration of the switching model is constructed, and this model (switching model being learned) is output to the optimization unit 22.
In step S22, the optimization unit 22 receives the model received from the switching model unit 21 and the switching label created by the switching label creation device 3 to be described later, optimizes parameters of the model, and returns the parameters to the switching model unit 21. Processing between model construction by the switching model unit 21 and parameter optimization by the optimization unit 22 may be configured to complete the optimization by repeating these processing by loop processing. In any case, when the optimization is completed and the parameter is determined, the content is reflected in the switching model unit 21, and the switching model is completed.
A specific method of optimization by the optimization unit 22 is as follows. The optimization unit 22 calculates a loss function between the switching label k generated by the switching label creation device 3 to be described later and the output value {circumflex over ( )}k calculated by the switching model unit 21, and optimizes the model parameters included in the switching model unit 21 by minimizing the loss function.
As the loss function, for example, a known cross entropy loss defined by the following expression can be used.
Here, the switching model unit 21 (and the switching model unit 12) may adopt a function of simultaneously estimating the SIR and the SNR of the observation signal, in addition to the calculation of {circumflex over ( )}k, in order to improve the identification performance of the voice recognition of the voice recognition unit 14. The SIR is an abbreviation of a signal to interference ratio, and is a true value of a ratio between a voice of a target speaker and a voice of another speaker. The SNR is an abbreviation of a signal to noise ratio, and is a true value of a ratio between voice and noise of a target speaker. Since the SIR indicates the ratio between the target speaker signal and the interference speaker signal, it is deeply related to the effect of voice enhancement. In addition, the SNR is closely related to the effect of voice enhancement because non-voice noise has a small adverse effect on voice recognition, but it is relatively difficult to remove the non-voice noise by voice enhancement.
Estimates of the SIR and the SNR of the observation signal by the switching model unit 21 are defined as {circumflex over ( )}SIR and {circumflex over ( )}SNR, respectively. That is, {circumflex over ( )}SIR is an output value of the switching model unit 21 when the SIR is input as an observation signal, and {circumflex over ( )}SNR is an output value of the switching model unit 21 when the SNR is input as an observation signal. When the voice of the target speaker is S, the voice of the interference speaker is I, and noise is N, the SIR and the SNR are defined by the following expressions.
When the SIR and the SNR of the observation signal are simultaneously estimated, the switching model unit 21 performs learning (hereinafter, also referred to as “multi-task learning”) to minimize a loss function obtained by weighting and adding a loss function related to an estimation error of the SIR and the SNR and the loss function for the switching label k. For example, the loss function of the SIR and the SNR estimation can use a square error as defined by the following expression.
Here, the loss function Lmulti due to multitasking is defined by the following expression using the parameters α and β.
The learning method of the switching model unit 21 has been described above by the processing of the switching model unit 21 and the optimization unit 22. The completed switching model unit 21 is used as the switching model unit 12 in the voice signal processing device 1.
The method of creating a switching label in the embodiment of the present invention is performed using the switching label creation device illustrated in
In step S31, the voice enhancement unit 31 performs voice enhancement processing. That is, the voice enhancement unit 31 acquires an observation signal as an input, extracts only desired voice from the acquired observation signal using a known voice enhancement technology, and performs voice enhancement processing. At this time, as the auxiliary information related to the target speaker, for example, a speech registered in advance by the target speaker or the like can be used. The voice enhancement unit 31 outputs the enhancement signal subjected to the voice enhancement processing to the voice recognition unit 32.
In step S32, the voice recognition unit 32 receives an observation signal including a voice, noise, or the like of another speaker in addition to the enhancement signal obtained from the voice enhancement unit 31. By performing voice recognition processing on each of the received observation signals, text information that is a voice recognition result corresponding to each voice signal is output to the recognition performance calculation unit 33.
In step S33, the recognition performance calculation unit 33 receives a transcription of the target speaker's voice in addition to the voice recognition result corresponding to the enhancement signal received from the voice recognition unit 32 and the voice recognition result for the observation signal. The transcription of the voice of the target speaker corresponds to correct information of a voice signal to be subjected to voice recognition. The recognition performance calculation unit 33 calculates the performance of voice recognition using the two voice recognition results and the transcription. As a method of calculating the voice recognition performance, a known voice recognition performance evaluation criterion such as a character error rate can be used. The recognition performance calculation unit 33 outputs the calculated performance result of voice recognition to the switching label generation unit 34.
In step S34, the switching label generation unit 34 generates a switching label k used as a training label by the optimization unit 22 illustrated in
Here, CERobs indicates voice recognition performance based on the character error rate of the observation signal, and CERenh indicates voice recognition performance based on the character error rate of the enhancement signal. In the case of the switching label k expressed by Expression (4) described above, when the character error rate of CERobs, which is the voice recognition performance of the observation signal, is lower than that of CERenh, which is the voice recognition performance of the enhancement signal (in other words, when CERobs has better voice recognition performance), the switching label k is set to 0 (zero). When the character error rate of CERenh, which is the voice recognition performance of the enhancement signal, is lower than that of CERobs, which is the voice recognition performance of the observation signal (in other words, when CERenh has better voice recognition performance), the switching label k is set to 1 (one). That is, the switching label k is a binary label of 0 or 1.
The switching label k may not be a binary label but may be determined more flexibly as follows. That is, the voice recognition performance of each of the observation signal and the enhancement signal may be compared and calculated on the basis of the performance difference. For example, T may be a temperature parameter, and the switching label k may be more flexibly determined than the binary label by using a definition formula of the following formula.
A method of determining the switching label k may be as follows. That is, a weight may be used that maximizes the voice recognition performance when the voice obtained by weighting and averaging the observation signal and the enhancement signal is recognized. As one method for achieving this, the voice recognition unit 32 may obtain a recognition result for the voice in which the observation signal and the enhancement signal are weighted and added at various ratios, the recognition performance calculation unit 33 may calculate the recognition performance for each of them, and the switching label generation unit 34 may use the weight that has achieved the highest recognition performance as the switching label k.
Through the above processing, pair data for four types of information including an observation signal, auxiliary information related to a target speaker, an enhancement signal, and a switching label is generated.
As illustrated in
When the hard method of the condition (d) and the model with multi-task learning in the present embodiment were used, only the case where SIR=0 and SNR=0 was inferior to the enhancement signal of the condition (b), two cases where SIR=0 and SNR=10 and 20 were equivalent to the enhancement signal of the condition (b), and the remaining six cases were results superior to the enhancement signal of the condition (b). The result of the average value was 1.9% superior to the result of the enhancement signal of the condition (b).
When the soft method of the condition (e) and the model with multi-task learning in the present embodiment were used, two cases where SIR=0 and SNR=10, 20 were inferior to the enhancement signal of the condition (b), no case was equivalent to the enhancement signal of the condition (b), and the remaining seven cases were results superior to the enhancement signal of the condition (b). The performance result of the average value was 2.6% superior to the performance result of the enhancement signal of the condition (b).
Regarding the performance improvement rate of the condition (e) with respect to the result of the condition (b) illustrated in
The voice signal processing method according to the embodiment of the present invention has been described above. By using the method of the present embodiment, in the present invention, by using {circumflex over ( )}k output from the switching model unit 12, it is possible to prevent performance deterioration due to voice enhancement by selectively using an enhancement signal and an observation signal, and to improve voice recognition performance. As a result, it is possible to appropriately determine the presence or absence of the voice enhancement in a case where the voice enhancement is not necessary even in the section in which the overlap speech occurs or in a case where the voice enhancement is necessary even in the section in which the overlap speech does not occur. As a result, the enhancement signal and the observation signal can be appropriately switched, and as a result, the voice recognition performance can be improved.
In addition, in the model with multi-task learning for estimating the SIR and the SNR described in the present embodiment, higher identification performance can be obtained by considering the SIR and the SNR deeply related to voice enhancement.
Furthermore, by weighting and adding the enhancement signal and the observation signal using {circumflex over ( )}k that is the output of the switching model unit 12, it is possible to determine the input voice in consideration of the uncertainty of the identification model.
Various kinds of processing described above may be executed not only in time series in accordance with the description but also in parallel or individually in accordance with processing abilities of the devices that execute the processing or as necessary. In addition to the above, it is needless to say that appropriate modifications can be made without departing from the scope of the present invention.
The various kinds of processing described above can be performed by causing a recording unit 2020 of a computer 2000 illustrated in
The program in which the processing content is written may be recorded on a computer-readable recording medium. The computer-readable recording medium may be, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, or a semiconductor memory.
In addition, the program is distributed by, for example, selling, transferring, or renting a portable recording medium such as a DVD or a CD-ROM on which the program is recorded. Further, a configuration may also be employed in which the program is stored in a storage device of a server computer and the program is distributed by transferring the program from the server computer to other computers via a network.
For example, a computer that executes such a program first temporarily stores a program recorded on a portable recording medium or a program transferred from the server computer in a storage device of the computer. Then, when executing processing, the computer reads the program stored in the recording medium of the computer and executes the processing according to the read program. In addition, as another mode of the program, the computer may read the program directly from the portable recording medium and execute processing according to the program, or alternatively, the computer may sequentially execute processing according to a received program every time the program is transferred from the server computer to the computer. In addition, the above-described processing may be executed by a so-called application service provider (ASP) type service that implements a processing function only by an execution instruction and result acquisition without transferring the program from the server computer to the computer. Note that the program in this mode includes information that is to be used in the process by an electronic computer, and is equivalent to the program (data and the like that are not direct commands to the computer but have properties that define the processing to be performed by the computer).
In addition, although the present devices are each configured by executing a predetermined program on a computer in the present embodiments, at least part of the processing content may be implemented by hardware.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/JP2021/045610 | 12/10/2021 | WO |