The present invention relates to a signal processing device, a signal processing method, and a signal processing program.
It is a challenge in speech processing to construct a speech recognition system that is robust against acoustic interference such as background noise and reverberation. Here, it has been confirmed that a multi-channel speech enhancement technology (beamformer) using a plurality of microphones greatly improves speech recognition performance.
Non Patent Literature 1: Szu-Jui Chen, Aswin Shanmugam Subramanian, Hainan Xu, and Shinji Watanabe, “Building state-of-the-art distant speech recognition using the chime-4 challenge with a setup of speech enhancement baseline”, in Interspeech, 2018, pp. 1571-1575.
On the other hand, in the single channel speech enhancement technology using a single microphone, even if an enhancement signal from which noise is removed is used, the speech recognition performance may be lower than an observation signal with noise, and the effect on improving the speech recognition performance is limited.
In practice, many devices are provided with only one microphone. Accordingly, in order to implement a robust speech recognition system, it is important to develop a speech enhancement technology for a single channel as well as a multi-channel speech enhancement technology.
The present invention has been made in view of the above, and an object thereof is to provide a signal processing device, a signal processing method, and a signal processing program capable of improving speech recognition performance by speech enhancement.
In order to solve the above-described problems and achieve the object, a signal processing device according to the present invention includes: a speech enhancement unit that generates, from an observation signal, an enhancement signal in which a voice of a speaker is enhanced; an addition unit that adds the observation signal to the enhancement signal; and a speech recognition unit that performs speech recognition on the enhancement signal to which the observation signal is added by the addition unit.
According to the present invention, it is possible to improve speech recognition performance by speech enhancement.
Hereinafter, an embodiment of the present invention will be described in detail with reference to the drawings. Note that the present invention is not limited by this embodiment. Furthermore, in the description of the drawings, the same parts are denoted by the same reference numerals. Note that in the following description, when “{circumflex over ( )}A” is used to describe A that is a vector or a matrix, it is assumed that the expression is equivalent to “the symbol in which “{circumflex over ( )}” is provided immediately above “A””. When “
The present embodiment proposes, as an example, a signal processing method for improving speech recognition performance on the basis of an analysis result obtained by analyzing a factor that degrades speech recognition performance of an enhancement signal of single channel speech enhancement (SE). Note that in the present embodiment, a signal processing method for an audio signal (observation signal) recorded by a single microphone (single channel) will be described, but the present invention is not limited to the single channel, and can be applied to audio signals recorded by a plurality of microphones (multi-channel).
First, a factor that degrades speech recognition performance of an enhancement signal of single channel SE was analyzed.
Usually, it is often assumed that processing distortion caused by the single channel SE is a cause of deterioration in speech recognition performance. However, such distortion, particularly the influence on speech recognition, has not been systematically analyzed in detail or solved. It is considered to be essential to elucidate the influence of the single channel SE estimation error on speech recognition in order to improve SE front-end design.
Here, a single channel SE task is focused. y∈RT indicates a T long-time domain waveform of an observation signal. The observation signal y is modeled as Expression (1). s∈RT represents a sound source signal. n∈RT indicates a background noise signal.
SE aims to reduce the noise signal n from the observation signal y. When the observation signal y is input, an enhancement signal {circumflex over ( )}s∈RT is estimated as {circumflex over ( )}s=SE(y). SE(·) indicates SE processing performed by a neural network, for example.
Subsequently, in order to analyze the influence of an SE estimation error on speech recognition performance, SE estimation error decomposition was examined using orthogonal projection.
Since the enhancement signal {circumflex over ( )}s is acquired by performing estimation processing, it is inevitable that the enhancement signal {circumflex over ( )}s includes an estimation error. The enhancement signal {circumflex over ( )}s is decomposed using orthogonal projection as in Expression (2).
In Expression (2), starget indicates a target sound source element, enoise ∈RT indicates a noise element (error), and eartif ∈RT indicates an artifact element (error) (see
Specifically, error decomposition by orthogonal projection decomposes an error in SE into a noise element and an artifact element. These two elements are obtained by projecting the SE error in an audio/noise subspace across an audio/noise signal and in a subspace orthogonal to the audio/noise subspace.
Since the noise element enoise includes a linear combination of an audio signal and a noise signal, it is expected that the noise element enoise is a signal that can be observed naturally. These are called natural signals. Since similar noise elements naturally appear in training samples, the influence of the natural signal on speech recognition performance may be limited.
On the other hand, the artifact element eartif includes a signal that cannot be represented by a linear combination of an audio signal and a noise signal (see
As an SE evaluation index, a signal to distortion ratio (SDR) (Expression (3)), a signal to noise ratio (SNR) (Expression (4)), and a signal to artifact ratio (SAR) (Expression (5)) are used.
Next, an experiment was conducted to examine the influence of the error element on the speech recognition performance of the artifact element eartif. In the experiment, in order to measure the influence of the artifact element eartif and the noise element enoise on the speech recognition performance, the enhancement signal was modified by changing the magnitude of the error element, and speech recognition was performed using the modified enhancement signal as an input.
Specifically, after decomposing the enhancement signal {circumflex over ( )}s using orthogonal projection, an enhancement signal {circumflex over ( )}sω∈RT was synthesized by increasing or decreasing the artifact element eartif and the noise element enoise as in Expression (6).
Here, ωnoise is a parameter that controls the quantity of the noise element enoise, and ωartif is a parameter that controls the quantity of artifact element eartif. In this experiment, in order to obtain various enhancement signals {circumflex over ( )}sωhaving different ratios of noise elements and artifact elements, values of ωnoise and ωartif are changed. As a result, it is possible to hold the same target sound source element starget while controlling the values of SNR and SAR. By inputting such a modified enhancement signal to the speech recognition system as an evaluation enhancement signal, the influence of each error element on the speech recognition performance was directly measured.
As illustrated in
Therefore, based on this finding, the present embodiment proposes a signal processing method for improving speech recognition performance. In the present embodiment, as an approach for reducing the influence of the artifact element, a method for reducing the ratio of the artifact component in a signal input to the speech recognition system has been studied.
In the present embodiment, the original sound (observation signal) is added to the enhancement signal to reduce the ratio of the artifact element in the signal input to the speech recognition system. Specifically, a signal obtained by adding the scaled observation signal y to the enhancement signal {circumflex over ( )}s is input to the speech recognition system as a modified enhancement signal
ωobs≥0 is a parameter for controlling the quantity of the observation signal y added to the enhancement signal {circumflex over ( )}s.
On the other hand, since the observation signal y is added to the enhancement signal {circumflex over ( )}s in the modified enhancement signal
A SAR improvement value SARi is calculated as in Equation (8). If SARi>0, the ratio of the artifact element eartif decreases when the original sound addition is performed. Note that Ps∈RT×T indicates an orthogonal projection matrix on the subspace across a sound source signal {sT}L−1T=0 (L-1 is number of allowable maximum delays). Ps,n∈RT×T indicates an orthogonal projection matrix on the subspace across the sound source signal and a noise signal {sT, nT}L−1T=0.
In the equation in the second column in Equation (8), Ps,ny=y and
A signal processing device to which original sound addition is applied for improving speech recognition performance will be described.
A signal processing device 10 according to the embodiment is implemented by, for example, a predetermined program being read by a computer or the like including a read only memory (ROM), a random access memory (RAM), a central processing unit (CPU), and the like, and the CPU executing the predetermined program. Furthermore, the signal processing device 10 includes a communication interface that transmits and receives various types of information to and from another device connected via a network or the like. As illustrated in
The speech enhancement unit 11 receives input of the observation signal y recorded in a single channel. For the purpose of reducing the noise signal n from the observation signal y, the speech enhancement unit 11 generates the enhancement signal {circumflex over ( )}s in which the voice of the speaker is enhanced from the observation signal y. The speech enhancement unit 11 performs speech enhancement processing using, for example, a neural network.
The original sound addition unit 12 adds the observation signal y (original sound) to the enhancement signal {circumflex over ( )}s. The original sound addition unit 12 inputs a signal obtained by adding the weighted observation signal y to the enhancement signal {circumflex over ( )}s to the speech recognition unit 13 as the modified enhancement signal
Note that the original sound addition unit 12 adjusts a weight ωobs of the observation signal y to be added to the enhancement signal {circumflex over ( )}s according to the ratio of the noise signal included in the observation signal y. For example, in a case where the ratio of the noise signal included in the observation signal y is lower than a certain value, the original sound addition unit 12 may set the value of the weight ωobs lower than the prescribed value. Furthermore, in a case where the ratio of the noise signal included in the observation signal y is higher than the certain value, the original sound addition unit 12 may set the value of the weight ωobs higher than the prescribed value. The original sound addition unit 12 may estimate the SNR of the observation signal y and determine the value of the weight ωobs on the basis of the estimation result.
Furthermore, the original sound addition unit 12 may weight both the observation signal y and the observation signal added to the enhancement signal {circumflex over ( )}s in a relationship in which the sum of the weight of the observation signal y and the weight of the observation signal added to the enhancement signal {circumflex over ( )}s is 1 as shown in Expression (10).
Furthermore, the original sound addition unit 12 may appropriately set a weight α of the observation signal y and a weight β of the observation signal to be added to the enhancement signal {circumflex over ( )}s as shown in Expression (11).
The speech recognition unit 13 performs speech recognition on the modified enhancement signal
Next, a signal processing method executed by the signal processing device 10 will be described.
As illustrated in
In practice, the speech recognition accuracy of the signal processing device 10 was evaluated. A neural network based time-domain denoising network (Denoising-TasNet) was adopted as the speech enhancement unit 11. A deep neural network hidden Markov model (DNN-HMM) hybrid automatic speech recognition (ASR) system based on the standard method of Kaldi was adopted as the speech recognition unit 13. A data set of reproduced reverberant noisy audio signals was generated from the Wall Street Journal (WSJ0) corpus of audio sources and the CHiME-3 corpus of noise sources, and was used as a training set, a development set, and an evaluation set.
As illustrated in
Accordingly, the signal processing device 10 was able to improve the performance of speech recognition as compared with the reference observation signal and the original enhancement signal {circumflex over ( )}s by performing the original sound addition. In other words, the signal processing device 10 was able to improve the single channel SE front-end speech recognition performance by reducing the ratio of the artifact element in the modified enhancement signal
Subsequently, the actual recording was evaluated. Actual recorded audio data (et05_real) of the CHiME-3 dataset was used to confirm the results of the actual recording.
As illustrated in
As described above, the signal processing device 10 according to the embodiment adds the observation signal y to the enhancement signal {circumflex over ( )}s and inputs the observation signal y to the speech recognition unit 13 in order to reduce the influence of the artifact element on the speech recognition performance. As a result, it has been demonstrated that the signal processing device 10 can monotonously increase the SAR value and improve the speech recognition performance. Furthermore, it has been found that the signal processing device 10 effectively improves the speech recognition performance even in actual recording.
Conventionally, it has been difficult to improve speech recognition performance particularly in single channel speech enhancement. Furthermore, there has been no configuration in which original sound addition is performed as a front end of speech recognition.
The signal processing device 10 according to the present embodiment has succeeded in improving the speech recognition performance in single channel speech enhancement only by adding simple processing of adding the original sound (observation signal) to the enhancement signal to the preceding stage of the speech recognition.
Each component of the signal processing device 10 is functionally conceptual, and does not necessarily have to be physically configured as illustrated. That is, specific forms of distribution and integration of the functions of the signal processing device 10 are not limited to the illustrated forms, and all or some of them can be functionally or physically distributed or integrated in any unit according to various loads, usage conditions, and the like.
Furthermore, all or any part of the processing performed in the signal processing device 10 may be implemented by a CPU, a graphics processing unit (GPU), and a program analyzed and executed by the CPU and the GPU. Furthermore, the processing performed in the signal processing device 10 may be implemented as hardware by wired logic.
Furthermore, among the processing described in the embodiment, all or part of the processing described as being automatically performed can be performed manually. Alternatively, all or part of the processing described as being manually performed can be automatically performed by a known method. In addition, the above-described and illustrated processing procedures, control procedures, specific names, and information including various data and parameters can be appropriately changed unless otherwise specified.
The memory 1010 includes a ROM 1011 and a RAM 1012. The ROM 1011 stores, for example, a boot program such as a basic input output system (BIOS). The hard disk drive interface 1030 is connected to a hard disk drive 1090. The disk drive interface 1040 is connected to a disk drive 1100. For example, a removable storage medium such as a magnetic disk or an optical disk is inserted into the disk drive 1100. The serial port interface 1050 is connected to a mouse 1110 and a keyboard 1120, for example. The video adapter 1060 is connected to, for example, a display 1130.
The hard disk drive 1090 stores, for example, an operating system (OS) 1091, an application program 1092, a program module 1093, and program data 1094. That is, the program that defines the processing of the signal processing device 10 is implemented as the program module 1093 in which a code executable by the computer 1000 is described. The program module 1093 is stored in, for example, the hard disk drive 1090. For example, the program module 1093 for executing processing similar to the functional configurations in the signal processing device 10 is stored in the hard disk drive 1090. Note that the hard disk drive 1090 may be replaced with a solid state drive (SSD).
Furthermore, setting data used in the processing of the above-described embodiment is stored as the program data 1094, for example, in the memory 1010 or the hard disk drive 1090. The CPU 1020 then reads the program module 1093 and the program data 1094 stored in the memory 1010 or the hard disk drive 1090 into the RAM 1012 as necessary and executes the program module 1093 and the program data 1094.
The program module 1093 and the program data 1094 are not limited to being stored in the hard disk drive 1090, and may be stored in, for example, a removable storage medium and read by the CPU 1020 via the disk drive 1100 or the like. Alternatively, the program module 1093 and the program data 1094 may be stored in another computer connected via a network (local area network (LAN), wide area network (WAN), or the like). The program module 1093 and the program data 1094 may then be read by the CPU 1020 from the another computer via the network interface 1070.
While the embodiment to which the invention made by the present inventors is applied has been described above, the present invention is not limited by the description and drawings included as a part of the disclosure of the present invention according to the present embodiment. That is, other embodiments, examples, operation technologies, and the like made by those skilled in the art or the like on the basis of the present embodiment are all included in the scope of the present invention.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/JP2021/044564 | 12/3/2021 | WO |