This patent application claims the benefit and priority of Chinese Patent Application No. 202110913351.1 filed on Aug. 10, 2021, the disclosure of which is incorporated herein by reference in its entirety.
The present disclosure relates to the technical field of speech recognition, and in particular, to a multimodal speech recognition method and system, and a computer-readable storage medium.
Speech interaction plays a vital role in intelligent interaction scenarios, such as smart households. Speech interaction provides non-contact man-machine interaction between people and Internet of Things (IoT) devices. Benefiting from the development of deep learning and natural language processing, an automatic speech recognition technology enables speech-interactive devices to accurately obtain what users speak. In recent years, commercial speech interaction products have become increasingly popular, such as smart speakers (for example, Amazon Echo and Google Home), voice assistants (for example, Siri) in smartphones, and in-vehicle speech control interaction (for example, speech interaction in Tesla Model S, X, 3, and Y).
In addition to home scenarios, today's speech interaction needs to address more diverse ambient noise (for example, traffic noise, commercial noise, and nearby sounds) in public places (for example, streets, stations, halls, or gatherings). However, a speech recognition technology based on microphone arrays and audio requires that an audio signal has a high signal-to-noise ratio and be clear. Therefore, in a noisy environment, an audio signal drowned in unpredictable noise becomes difficult to recognize. In addition, speech quality deteriorates as a recognition distance increases, thereby affecting recognition accuracy. To resolve these difficulties, researchers use multi-sensor information fusion for speech enhancement and recognition. For example, an audio-visual method combines lip motion captured by a camera with noisy sounds, but is limited by lighting conditions, line-of-sight requirements, or coverings. Although a working distance of an ultrasound-assisted speech enhancement technology is extremely short (below 20 cm), a specific posture is required.
The present disclosure provides a multimodal speech recognition method and system, and a computer-readable storage medium to implement high-accuracy speech recognition.
To implement the foregoing objective, the present disclosure provides the following solutions:
A multimodal speech recognition method includes:
Optionally, obtaining the target millimeter-wave signal and the target audio signal, may specifically include:
Optionally, calculating the first logarithmic mel-frequency spectral coefficient and the second logarithmic mel-frequency spectral coefficient when the target millimeter-wave signal and the target audio signal both contain the speech information corresponding to the target user, may specifically include:
Optionally, determining whether the target millimeter-wave signal and the target audio signal both contain the speech information to obtain the first determining result, may specifically include:
Optionally, determining whether the target millimeter-wave signal and the target audio signal both come from the target user, may specifically include:
Optionally, the fusion network may further include two identical branch networks, namely, a first branch network and a second branch network; and each branch network may include a first residual block with efficient channel attention (ResECA), a second ResECA, a third ResECA, a fourth ResECA, and a fifth ResECA; where
Optionally, the feature calibration performed by the calibration module may specifically include:
Optionally, the fusion performed by the mapping module may specifically include:
A multimodal speech recognition system includes:
A computer-readable storage medium is provided. The computer-readable storage medium stores a computer program. The computer program, when executed by a processor, implements the steps of the multimodal speech recognition method.
Based on specific embodiments provided in the present disclosure, the present disclosure has the following technical effects:
Considering that the millimeter-wave signal is not affected by noise and can perceive throat vibration information when a user is speaking, when the audio signal is polluted by noise, the present disclosure uses the fusion network to perform the mutual feature calibration and fusion on the millimeter-wave signal and the audio signal, that is, performs mutual calibration on a millimeter-wave feature and an audio feature and integrates vibration information in the millimeter-wave signal into the audio feature, to obtain the target fusion feature; and guides the semantic feature network to capture semantic information in the target fusion feature with high accuracy.
To describe the technical solutions in embodiments of the present disclosure or in the prior art more clearly, the accompanying drawings required in the embodiments will be briefly described below. Apparently, the accompanying drawings in the following description show merely some embodiments of the present disclosure, and other drawings can be derived from these accompanying drawings by those of ordinary skill in the art without creative efforts.
The technical solutions of the embodiments of the present disclosure are clearly and completely described below with reference to the accompanying drawings. Apparently, the described embodiments are merely some rather than all of the embodiments of the present disclosure. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present disclosure without creative efforts shall fall within the protection scope of the present disclosure.
To make the foregoing objective, features, and advantages of the present disclosure clearer and more comprehensible, the present disclosure will be further described in detail below with reference to the accompanying drawings and specific embodiments.
Terms in the present disclosure and their abbreviations:
Researches have proven that millimeter-wave signals have excellent resistance to ambient noise and penetration, and are helpful for speech information recovery. Based on the problems raised in the background art, the present disclosure uses a millimeter-wave radar as a supplement to speech recognition. The millimeter-wave radar can perceive a remote target user. Even if the user wears a mask in a noisy environment, a reflected signal received by the millimeter-wave radar still contains throat vibration information. However, performance of the millimeter-wave radar is not always satisfactory. A millimeter-wave signal has an extremely short wavelength (about 4 mm) and is very sensitive to both vocal vibration and motion vibration. Therefore, in practice, the millimeter-wave signal is affected by body motion of the user. Fortunately, microphone-based speech acquisition can make up for information loss to some extent. Therefore, the present disclosure considers complementary cooperation between the millimeter-wave radar and the microphone, and combines two different modal signals for speech recognition. To be specific, the millimeter-wave signal supports anti-noise speech perception, and an audio signal acquired by the microphone can be used as a guide for calibrating a millimeter-wave feature under motion interference.
In view of this, the present disclosure provides a multimodal speech recognition method and system that fuse a millimeter-wave signal and an audio signal. First, speech activity detection is performed and a user is determined based on a correlation between a millimeter-wave signal and an audio signal to obtain a millimeter-wave signal and an audio signal corresponding to the user. Then, the millimeter-wave signal and audio signal are input into a fusion network for full fusion to obtain a fusion feature. Finally, the fusion feature is input into a semantic extraction network to obtain semantic text, namely, a speech recognition result. The present disclosure integrates and enhances advantages of the millimeter-wave signal and audio signal, and implements high-accuracy speech recognition under harsh conditions such as high noise, a long distance, and a plurality of angles.
Referring to
In a preferred implementation, the step 10 may specifically include:
In a preferred implementation, the step 20 may specifically include:
Further, determining whether the target millimeter-wave signal and the target audio signal both contain speech information to obtain the first determining result may specifically include:
Further, determining whether the target millimeter-wave signal and the target audio signal both come from the target user, may specifically include:
In a preferred implementation, the fusion network may further include two identical branch networks such as a first branch network and a second branch network. The branch network may include a first ResECA, a second ResECA, a third ResECA, a fourth ResECA, and a fifth ResECA.
In a preferred implementation, as shown in
An input end of the first ResECA of the first branch network is used to input the first logarithmic mel-frequency spectral coefficient. An output end of the first ResECA of the first branch network is connected to an input end of the second ResECA of the first branch network. An output end of the second ResECA of the first branch network is connected to an input end of the third ResECA of the first branch network. An output end of the fourth ResECA of the first branch network is connected to an input end of the fifth ResECA of the first branch network.
An input end of the first ResECA of the second branch network is used to input the second logarithmic mel-frequency spectral coefficient. An output end of the first ResECA of the second branch network is connected to an input end of the second ResECA of the second branch network. An output end of the second ResECA of the second branch network is connected to an input end of the third ResECA of the second branch network. An output end of the fourth ResECA of the second branch network is connected to an input end of the fifth ResECA of the second branch network.
Input ends of the mapping module are respectively connected to an output end of the fifth ResECA of the first branch network and an output end of the fifth ResECA of the second branch network.
Further, the feature calibration performed by the calibration module may specifically include:
Further, the fusion performed by the mapping module may specifically include:
As shown in
In step 1, a target user stands about 7 meters away from a millimeter-wave radar and a microphone, and speaks a wakeup word and a speech command. In this case, the millimeter-wave radar acquires a millimeter-wave signal, and the microphone acquires an audio signal.
First, both signals are clipped into 3 seconds in length, and then the signals are normalized and downsampled to 16 kHz. Next. FFT is performed on a downsampled millimeter-wave signal to extract a millimeter-wave phase signal, and a difference operation is performed on the millimeter-wave phase signal to extract a millimeter-wave phase difference signal. A downsampled audio signal and the millimeter-wave phase difference signal are multiplied to obtain a product component. Next, it is determined whether the millimeter-wave signal or the audio signal contains speech information.
A spectral entropy of the product component is calculated. If the spectral entropy is greater than a specified threshold, that is, greater than 0.83, it indicates that the millimeter-wave signal and the audio signal both contain speech information. Otherwise, it indicates that the millimeter-wave signal or the audio signal does not perceive speech information. The step 2 is performed on the millimeter-wave signal and the audio signal that perceive the speech information to determine whether the millimeter-wave signal and the audio signal both come from the target user, rather than interference from others.
In step 2, an LPC component is extracted from the product component, and the LPC component is input into a trained OC-SVM to determine whether the millimeter-wave signal and the audio signal both come from the target user. If the LPC component comes from the target user, step 3 will be performed. Otherwise, steps 1 and 2 will be performed. The trained OC-SVM is obtained through training based on a millimeter-wave signal and an audio signal corresponding to a calibration user in advance.
The training may include: the calibration user speaks the wakeup word to the millimeter-wave radar and microphone 30 times; the preprocessing in step 1 is performed on an acquired millimeter-wave signal and audio signal to obtain a calibration product component; and a calibration LPC component extracted from the calibration product component and the calibration user are used to train an OC-SVM to enable the OC-SVM to determine whether the LPC component comes from the target user.
In step 3, STFT is performed on the millimeter-wave signal and the audio signal containing speech information of the user; then, logarithmic mel-frequency spectral coefficients of a millimeter-wave signal and an audio signal obtained after the STFT, are respectively calculated; and finally, the logarithmic mel-frequency spectral coefficients are input into a fusion network to obtain a fusion feature.
The fusion network includes two branch networks. The branch networks respectively receive the logarithmic mel-frequency spectral coefficients from the millimeter-wave signal and the audio signal. Each branch network is composed of five ResECAs. The fusion network further includes two modules. One is a calibration module, configured to calibrate two input features, where the calibration module is located after a third ResECA, and output of the third ResECA passes through the calibration module and then flows into a fourth ResECA. The other is a mapping module, which maps the two features into a same feature space to obtain the final fusion feature, where the mapping module is located after a fifth ResECA and receives the millimeter-wave signal and the audio signal from the two branch networks respectively.
A mathematical principle of the calibration module is described. XW∈RH×W×C and XS∈RH×W×C are two intermediate features from their respective branch networks, where R represents a real number domain, H represents a width, W represents a length, C represents a size of channel dimensions, and subscripts W and S respectively represent the millimeter-wave signal and the audio signal. Channel feature distributions YW and YS of the two intermediate features are calculated.
YW=σ(WWReLU(GAP(XW))),YW∈R1×1×C (1)
YS=σ(WSReLU(GAP(XS))),YS∈R1×1×C (2)
where ReLU represents a ReLU function, WW and WS represent learning parameter matrices, σ represents a sigmoid function, and GAP represents a global pooling function. The channel feature distributions YW and YS can be regarded as a feature detector and filter. Mutual feature calibration is implemented by using formulas (3) and (4):
{tilde over (X)}W=YS⊙XW+XW,{tilde over (X)}W∈RH×W×C (3)
{tilde over (X)}S=YW⊙XS+XS,{tilde over (X)}S∈RH×W×C (4)
{tilde over (X)}W and {tilde over (X)}S respectively represent a final calibrated millimeter-wave feature and audio feature. Based on a correlation between the two features, that is, the two features both containing the speech information of the user, the mutual calibration can enhance important information and suppress irrelevant interference information in respective feature maps.
To map two features from different feature spaces, namely, the millimeter-wave feature and audio feature, to the same feature space, the mapping module is designed and inserted at the end of the fusion network to generate the final fusion feature. Assuming that M∈RH×W×C and V∈RH×W×C are the millimeter-wave feature and the audio feature respectively from the branch networks, M and V are flattened into two-dimensional variables with a size of RC×HW. A similarity matrix of M and V is calculated.
S=MTWMVV,S∈RHW×HW (5)
where WMV represents a learning parameter matrix, and each element of S reveals a correlation between corresponding columns of M and V. Softmax normalization and column normalization are respectively performed on the similarity matrix as follows:
SM=softmax(S),SM∈RHW×HW (6)
SV=softmax(ST),SV∈RHW×HW (7)
The similarity matrix SM can convert a millimeter-wave feature space into an audio feature space. Similarly, the similarity matrix SV can convert the audio feature space into the millimeter-wave feature space. Corresponding attention features are calculated as follows:
CM=V⊗SM,CM∈RC×HW (8)
CV=M⊗SV,CV∈RC×HW (9)
⊗ represents matrix multiplication. Finally, the final fusion feature Z can be obtained based on the both attention features:
Z=WZ{σ(CM)⊙M+σ(CV)⊙V},Z∈RC×HW (10)
where WZ is a learning parameter matrix, which represents Z with two modal features selectively integrates information. Fine-grained elements related to speech vibration and acoustic features in Z are dominant. The final fusion feature output by the fusion network is input into a semantic extraction network for speech recognition.
In step 4, the final fusion feature is input into a semantic feature network to obtain semantic text, namely, a speech recognition result. The semantic feature network in this method is classic LAS, where the LAS consists of two components: an encoder called Listener and a decoder called Speller. Listener uses a pBLSTM to map the fusion feature to a hidden feature. Speller is a superimposed recurrent neural network that calculates a probability of an output character sequence, which uses a multi-head attention mechanism to generate context vectors. In the LAS, Listener includes two consecutive pBLSTM layers, and Speller includes two LSTM layers and an output Softmax layer. After receiving the fusion feature in step 3, the LAS outputs the speech recognition result.
Referring to
This embodiment provides a computer-readable storage medium. The computer-readable storage medium stores a computer program.
When executed by a processor, the computer program implements the steps of the multimodal speech recognition method in Embodiment 1 or Embodiment 2.
Compared with the prior art, the present disclosure has the following effects:
Each embodiment of this specification is described in a progressive manner, each embodiment focuses on the difference from other embodiments, and the same and similar parts between the embodiments may refer to each other. Since the system disclosed in the embodiments corresponds to the method disclosed in the embodiments, the description is relatively simple, and reference can be made to the method description.
In this specification, several specific embodiments are used for illustration of the principles and implementations of the present disclosure. The description of the foregoing embodiments is used to help illustrate the method of the present disclosure and the core ideas thereof. In addition, persons of ordinary skill in the art can make various modifications in terms of specific implementations and the scope of application in accordance with the ideas of the present disclosure. In conclusion, the content of this specification shall not be construed as a limitation to the present disclosure.
Number | Date | Country | Kind |
---|---|---|---|
202110913351.1 | Aug 2021 | CN | national |
Number | Name | Date | Kind |
---|---|---|---|
6405168 | Bayya | Jun 2002 | B1 |
20110153326 | Garudadri | Jun 2011 | A1 |
20140112556 | Kalinli-Akbacak | Apr 2014 | A1 |
20140365221 | Ben-Ezra | Dec 2014 | A1 |
20230008363 | Liu | Jan 2023 | A1 |
Entry |
---|
Liu, Tiantian, Ming Gao, Feng Lin, Chao Wang, Zhongjie Ba, Jinsong Han, Wenyao Xu, and Kui Ren. “Wavoice: A Noise-resistant Multi-modal Speech Recognition System Fusing mmWave and Audio Signals.” (2021). (Year: 2021). |
Xu, Chenhan, Zhengxiong Li, Hanbin Zhang, Aditya Singh Rathore, Huining Li, Chen Song, Kun Wang, and Wenyao Xu. “WaveEar: Exploring a mmWave-based Noise-resistant Speech Sensing for Voice-User Interface.” In the 17th Annual International Conference on Mobile Systems, Applications, and Services. 2019. (Year: 2019). |
Wang, Qilong, Banggu Wu, Pengfei Zhu, Peihua Li, Wangmeng Zuo, and Qinghua Hu. “ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks.” (Year: 2020). |
Yu, Jianwei, Shi-Xiong Zhang, Jian Wu, Shahram Ghorbani, Bo Wu, Shiyin Kang, Shansong Liu, Xunying Liu, Helen Meng, and Dong Yu. “Audio-Visual Recognition of Overlapped Speech for the LRS2 Dataset.” (Year: 2020). |
Number | Date | Country | |
---|---|---|---|
20230047666 A1 | Feb 2023 | US |