The present application claims priority under 35 U.S.C. § 119 to Chinese Patent Application No. 202211505381.X, filed with the China National Intellectual Property Administration on Nov. 28, 2022, the disclosure of which are incorporated by reference herein in its entirety.
The present disclosure relates to the fields of speech processing and artificial intelligence, and in particular, to a method performed by an electronic device and an apparatus.
Currently, in the field of speech separation, deep learning-based speech separation algorithms have surpassed traditional signal processing, and the high nonlinear modeling capabilities of deep learning-based speech separation algorithms may achieve better results in the task. Among the methods of deep learning, recurrent neural networks are particularly suitable for describing input data with sequence relationships in natural language and time sequence due to their natural timing-sequence-dependent nature, which is an important component of modern intelligent speech processing systems, and their recurrent connections are crucial for learning long sequence relationships of speech and correctly managing speech context. However, since computation of a next step of the recurrent neural network relies on hidden layer states output in a previous step, the existing speech separation schemes cannot accurately separate the speech signals of each sound source when there is no sound source signal to be separated within a certain period of time, and the separation accuracy needs to be further optimized.
The exemplary embodiments of the present disclosure provide a method performed by an electronic device and a device that solve at least the above technical problem and other technical problems not mentioned above, and provide the following beneficial effects.
According to an aspect of the exemplary embodiments of the present application, there is provided a method performed by an electronic device, the method may include: obtaining an audio signal comprising a speech signal uttered by at least one sound source; determining a target audio segment of the audio signal, wherein the target audio segment is determined based on a speech quality of at least one audio segment, wherein the at least audio segment is divided from the audio signal; and performing speech separation on the audio signal based on the target audio segment to obtain at least one separated speech signal corresponding to the at least one sound source.
According to an aspect of the exemplary embodiments of the present application, there is provided a method performed by an electronic device, the method may include: obtaining a training sample, wherein the training sample includes a speech signal uttered by at least one sound source in a noiseless environment and an audio signal composed of the speech signal and a noise signal; determining, by an audio segment search module included in a speech processing model, a target audio segment of the audio signal, wherein the target audio segment is determined based on speech quality of each audio segment divided from the audio signal; performing, by a separation module included in the speech processing model, speech separation on the audio signal according to the target audio segment, to obtain a separated speech signal corresponding to each sound source; adjusting parameters of the speech processing model based on the obtained speech signal and the corresponding separated speech signal.
According to an aspect of the exemplary embodiments of the present application, there is provided an electronic device, the electronic device includes: at least one memory storing computer executable instructions; and at least one processor. The at least one processor, when executing the stored instructions, is configured to: obtain an audio signal comprising a speech signal uttered by at least one sound source; determine a target audio segment of the audio signal, wherein the target audio segment is determined based on a speech quality of at least one audio segment, wherein the at least audio segment is divided from the audio signal; and perform speech separation on the audio signal based on the target audio segment to obtain at least one separated speech signal corresponding to the at least one sound source.
The computer executable instructions may include first obtaining code configured to cause the at least one processor to obtain an audio signal to be processed, wherein the audio signal comprises a speech signal uttered by at least one sound source; first determining code configured to cause the at least one processor to determine a target audio segment of the audio signal, wherein the target audio segment is determined based on a speech quality of each audio segment, wherein each audio segment is divided from the audio signal; and first performing code configured to cause the at least one processor to perform speech separation on the audio signal based on the target audio segment to obtain a separated speech signal corresponding to each sound source.
According to an aspect of the exemplary embodiments of the present application, there is provided a non-transitory computer readable storage medium storing instructions that, when executed by at least one processor, cause the at least one processor to obtain an audio signal comprising a speech signal uttered by at least one sound source; determine a target audio segment of the audio signal, wherein the target audio segment is determined based on a speech quality of at least one audio segment, wherein the at least audio segment is divided from the audio signal; and perform speech separation on the audio signal based on the target audio segment to obtain at least one separated speech signal corresponding to the at least one sound source.
According to an aspect of the exemplary embodiments of the present application, there is provided a computer program product in which instructions are executed by at least one processor in an electronic device to perform the above method.
By using a modeling method of adaptively connecting target audio segments to separate each sound source signal from the audio signal, the present disclosure can not only solve the problem of long-term forgetting of the prediction network, but also significantly improve the accuracy of speech separation.
These and/or other aspects and advantages of the present disclosure will become clear and easier to understand from the following description of the embodiments in conjunction with the accompanying drawings, wherein:
In order to make those skilled in the art better understand the technical solutions of the present disclosure, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.
The following description with reference to the attached drawings is provided to assist in a complete understanding of the embodiments of the present disclosure as defined by the claims and their equivalents. A variety of specific details are included to assist in understanding, but these details are considered only exemplary. Thus, those skilled in the art will be aware that the embodiments described herein may be subject to various changes and modifications without departing from the scope and spirit of the present disclosure. In addition, the description of the function and structure of the common knowledge is omitted for clarity and brevity.
The terms and words used in the following description and claims are not limited to the written meaning and are used only by the inventor to achieve a clear and consistent understanding of the present disclosure. Accordingly, it should be clear to those skilled in the art that the following description of the various embodiments of the present disclosure is provided only for illustrative purposes and not to limit the purposes of the present disclosure defined by the claims and their equivalents.
It should be noted that the terms “first”, “second” and the like in the description and claims as well as the above drawings of the present disclosure are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that the data used in this way may be interchanged under appropriate circumstances, so that the embodiments of the present disclosure described herein can be implemented in an order other than those illustrated or described herein. The embodiments described in the following exemplary embodiments are not representative of all embodiments consistent with the present disclosure. Rather, they are only examples of devices and methods consistent with some aspects of the present disclosure as detailed in the appended claims.
Existing speech separation schemes have the problem of long-term forgetting. When a speaker keeps quiet and does not speak for a long time (for example, 1-5 minutes), the speech separation networks will suffer from the problem of separation errors which are caused by the following two main reasons:
The existing speech separation networks need to learn features of speakers, such as speaker's pronunciation habits, rhythm, intonation, etc., and use the learned features of the speakers to separate the speakers' voices from a mixed speech signal. Take an intelligent meeting scenario as an example, when a speaker A finishes speaking and a speaker B starts to speak, if the speaker A keeps quiet (for example, 1-5 minutes), the speech separation network will gradually forget speech features of the speaker A, which will result in incorrect separation results.
In addition, the optimal number of iterations of existing speech separation neural networks is between 200 and 300. If the number of iterations exceeds this interval, the performance of the neural network will be reduced. However, in the speech separation network, a time interval of a global path for connecting audio units of each audio segment is about 20 ms. If a signal of 1 to 5 minutes is processed, the number of iterations of the global path will be as high as 3000 to 15000, which will lead to a serious problem of long-term forgetting in the network.
In order to improve the existing technology and enable the neural network to retain users' features even when the users do not speak for a long time, the present disclosure presents a new speech separation algorithm that adaptively connects high-quality audio segments of a speaker (also known as target audio segments). Here, the high-quality audio segments may be understood as preserving the speaker's speech features of high signal-to-noise ratio and low distortion.
First of all, the present disclosure designs a high-quality audio segment search module, which may find high-quality audio segments for each sound source (such as a speaker) by analyzing speech features of an input signal, so as to improve the signal-to-noise ratio of separated speech and reduce the distortion of separated speech.
Secondly, the present disclosure designs an adaptive path separation module, which uses a path of adaptively connecting high-quality audio segments for modeling, and extracts and fuses hidden layer state information of the high-quality audio segments to solve the problem of long-term forgetting.
The algorithm presented in the present disclosure may be applied not only to the separation of individual voices from an audio signal including multiple voices, but also to the separation of voices from background noise from an audio signal including both the voices and the background noise, for speech enhancement.
Hereinafter, according to various embodiments of the present disclosure, the methods and devices of the present disclosure are described in detail with reference to the attached drawings.
According to
The audio segment search module may find high-quality audio blocks (each audio block may be composed of multiple audio segments, and one audio segment may be composed of multiple base units. For example, a time length of each audio block is about 8 s, and a time length of each audio segment is about 20 ms) by using speech distortion judgment without reference signals and long-term speech feature analysis. Then, high-quality audio segments are found from the high-quality audio blocks through short-term speech feature analysis, and each sound source (such as a speaker) is operated according to the above method to obtain high-quality audio segment indexes (such as an ID of an audio segment) of each sound source.
The self-adaptive path separation module may perform modeling for the features of the input audio signal (i.e., perform feature separation on the input audio signal) by using a local path and an adaptive path that adaptively connects high-quality audio segments. Among them, the local path (which may be understood as intra-frame modeling) is to perform modeling for features between two adjacent base units/audio units. The adaptive path (also understood as inter-frame modeling, that is, modeling for a current frame using speech features of a reference frame) is based on modeling of network hidden layer state information obtained by fusing a high-quality audio segment with a previous audio segment of a current audio segment, to obtain a mask of each sound source. The problem of long-term forgetting may be solved by using the network hidden layer state information.
The decoding module may use the mask to decode the input audio signal and obtain an audio signal of each sound source, such as separated speeches of the speaker A and the speaker B. The disclosed speech processing model may separate speech signals of two or more persons from a mixed audio signal.
When separating a mixed audio in real time, it may be processed in the above way frame by frame. When a high-quality audio segment of a current audio segment is determined, several audio segments before the current audio segment may be taken as an audio block to find the corresponding high-quality audio segment in the audio block.
According to
With reference to
In step S602, a target audio segment in the audio signal is determined. The target audio segment may be determined based on speech quality of individual audio segments divided from the audio signal. In the present disclosure, the target audio segment may also be referred to as a high-quality audio segment, and one audio segment may be interpreted as a frame of an audio signal.
The target audio segment is an audio segment with high speech quality that at least one of speech distortion, signal-to-noise ratio, zero-crossing rate, and pitch quantity meets a predefined condition. In the present disclosure, the separation module may be realized based on hidden layer state information obtained through adaptive connection of high-quality audio segments. If the separation effect is good, it indicates that the hidden layer state information may well express features of each sound source, which means that the separation effect may be used to evaluate whether the hidden layer state information may well express the features of the speech signal. Therefore, high-quality audio segments in the original audio signal may be determined for each audio block.
Considering the characteristics of long-term relative stability and short-term instability of audio signals, both need to be taken into account. Therefore, the evaluation of speech quality in the present disclosure is performed on two scales of long-term and short-term. The audio signal may be divided into audio blocks (e.g., one audio block is 8 s) according to a first time period and each audio block may be divided into audio segments (e.g., one audio segment is 20 ms) according to a second time period.
As an example, for each audio block, a high-quality audio segment for a current audio block may be determined based on the audio quality of the individual audio segments in that audio block. The high-quality audio segment for the current audio block may be used for feature separation of the current audio block or a next audio block.
As another example, for each sound source that has been separated, a high-quality audio segment for each sound source in the original audio signal is determined for each audio block. The high-quality audio segments for the current audio block may be used for feature separation of a next audio block. In other words, the high-quality audio segment determined at the present moment may be used for feature separation at the next moment.
With reference to
Traditional methods to calculate speech distortion need to use an original pure speech signal as a reference signal. By calculating correlation between a separated speech signal after audio processing and the original reference signal, the distortion of the processed speech signal may be obtained, as shown in
However, in speech separation, because the input signal is a mixed audio signal, an ideal reference signal may not be obtained, so the present disclosure presents a speech distortion calculation method without the reference signal. For each audio block, the speech distortion may be determined by calculating correlation between the separated speech signal for the current audio block and a reference audio signal (that is, to subtract the separated speech signal from an audio signal obtained by the original audio signal for the corresponding time period).
With reference to
In addition, the speech distortion may be achieved using a timing processing network such as LSTM, which may be used to compare the correlation between the speech signal Y and the audio signal S.
In the present disclosure, the speech signal-to-noise ratio may be determined by calculating a ratio between a separated speech signal for the current audio block and an original audio signal of the corresponding time period. For example, for each audio block, the ratio of the separated audio to the original audio may be calculated to determine if there is any redundant component in the current audio block.
In addition, considering that the audio block with good separation effect may not contain the features of the sound source (such as the speaker remaining silent during this time period), the separation effect may be measured by incorporating methods such as pitch detection. For example, for each audio block, whether the current audio block contains a vowel may be determined by analyzing the zero crossing rate of the current audio block and whether the current audio block contains a pitch (that is, the number of pitches). Since vowels are an important part of speech signals, the speech components in the current audio block may be analyzed based on this.
For example, when the speech distortion and the signal-to-noise ratio of the current audio block meet the preset conditions and contain vowels, the current audio block may be determined as a high-quality audio block, and then short-term signal analysis may be carried out.
If the current audio block is a high-quality audio block, it may be determined whether the speech quality of a current audio segment in the current audio block is higher than that of a previous audio segment of the current audio block. If the speech quality of the current audio segment is higher than that of the previous audio segment, it may be determined that whether the speech quality of the current audio segment is higher than that of a target audio segment determined for a previous audio block. Based on the comparison result, the target audio segment corresponding to each sound source is determined for the current audio block.
For each audio segment in a high-quality audio block, the speech distortion of the current audio segment may be determined by calculating the correlation between the separated speech signal for the current audio segment and an audio signal obtained by subtracting the separated speech signal from an original audio signal for the corresponding time period. The SNR of the current audio segment may be determined by calculating the ratio between the separated speech signal for the current audio segment and the original audio signal for the corresponding time period. Whether there is a vowel in the current audio segment may be determined by analyzing the zero crossing rate of the current audio segment and whether there is a pitch. Based on the above calculation, whether the speech quality of the current audio segment is higher than that of the previous audio segment is determined. The calculation of short-term speech distortion, short-term signal-to-noise ratio and short-term pitch analysis is similar to the above long-term calculation method. Here, the audio signal of an audio segment is selected for calculation.
For example, if the speech distortion and the signal-to-noise ratio of the current audio segment are better than those of the previous audio segment, the speech quality of the current audio segment is higher than that of the previous audio segment. Next, the current audio segment is compared with the high-quality audio segment determined for the previous audio block to find the high-quality audio segment suitable for the current audio segment. Here, the high-quality audio segment determined for the previous audio block refers to a high-quality audio segment determined when searching for the high-quality audio segment for the previous audio block, which may or may not be the audio segment from the previous audio block.
In a case where the speech quality of the current audio segment is higher than that of the target audio segment determined for the previous audio block, the current audio segment may be determined as the target audio segment, that is, the high-quality audio segment is used for the next audio block.
In a case where the speech quality of the current audio segment is lower than that of the target audio segment determined for the previous audio block, if a difference between the speech quality of the current audio segment and that of the target audio segment determined for the previous audio block is less than a preset threshold and a time interval between the current audio segment and the target audio segment determined for the previous audio block is greater than a time threshold, the current audio segment may be determined as the target audio segment, that is, the high-quality audio segment is used for the next audio block.
If any one of the above conditions is not met, the previously determined high-quality audio segment remains unchanged, that is, when the current audio block is separated, the high-quality audio segment determined for the previous audio block is used.
According to another example of the present disclosure, if the current audio block belongs to the target audio block, the speech quality of each audio segment in the current audio block may be determined and a first audio segment with the highest speech quality may be selected from the audio segments. Then it is determined whether the speech quality of the first audio segment is higher than that of the target audio segment determined for the previous audio block of the current audio block, and the target audio segment corresponding to each sound source for the current audio block is determined based on the comparison result. For example, if the speech quality of the first audio segment is higher than that of the target audio segment determined for the previous audio block, the first audio segment may be determined as the target audio segment; if the speech quality of the first audio segment is lower than that of the target audio segment determined for the previous audio block, and if the difference between the speech quality of the first audio segment and that of the target audio segment determined for the previous audio block is less than the preset threshold and the time interval between the first audio segment and the target audio segment determined for the previous audio block is greater than the time threshold, the current audio segment is determined as the target audio segment.
If the current audio block does not belong to the high-quality audio block, the search for the high-quality audio segment may be ended, that is, the previously determined high-quality audio segment is continually used.
After the high-quality audio segment is determined for the current audio block, the high-quality audio segment index for each sound source may be output. In this way, the separation module of the present disclosure may use the high-quality audio segment index determined for the current audio block to find the corresponding audio segment when the next audio block is separated.
In step S603, speech separation is performed on the audio signal based on the target audio segment to obtain a separated speech signal corresponding to each sound source.
In order for modeling for the audio signal and use the neural network to learn inherent connection between the audio signals, the feature extraction of the audio signal may be carried out first to obtain high-dimensional speech feature information. The audio signal may be encoded to obtain encoding features of the audio signal. For example, the audio signal may be transformed by discrete Fourier transform and encoded into a high-dimensional feature vector, which may be used to represent speech information of different dimensions. Here, the encoding may be performed by the encoding module of the speech processing model of the present disclosure.
As an example, the encoding module may be used to extract the features of the audio signal and obtain a feature vector of another dimension, which is helpful to the modeling and learning of the audio signal by the neural network. Short-Time Fourier Transform (STFT) may be used for feature extraction. For example, the encoding module may use STFT to perform frame splitting and Fourier transform on an audio signal s1, to obtain speech features in the frequency domain. For an audio signal with a sampling rate of 16K Hz and a duration of n seconds, there are sampling points L=n*16000. After performing the STFT of s_n points, that is, the number of sampling points per frame is s_n, an overlap area between frames is s_n/2 (i.e., 50% overlap rate), and the number of frames M=L/(s_n/2)-1, and the number of frequency points per frame f-s_n/2. The real and imaginary parts of the frequency domain are taken out respectively, then the dimension of the output feature vector is [M, f]. In the following description, an audio signal with a sampling rate of 16K Hz and a duration of 8 s is used as an example.
When STFT of 512 sampling points per frame is performed on a speech signal with a duration of 8 s and a sampling rate of 16K Hz, the number of the frequency points per frame may be obtained, i.e., s_n/2=512/2=256, and a feature vector with a dimension of [499,256] may be obtained, that is, there are 499 frames, each frame has 256 frequency points. Each frequency point is represented by a real part and an imaginary part.
In addition, other feature extraction methods (such as a convolutional neural network (CNN)) may also be used for feature extraction, and the present disclosure is not limited thereto.
After feature extraction in the previous step, the encoding module may directly use an encoder to encode and obtain a feature vector of a higher dimension. It may also divide the extracted features to obtain sub-features, and then encode the sub-features by using respective encoders, so as to reduce the complexity of the neural network and improve the processing speed of the neural network.
As an example, a subband-based encoding method may be adopted when the extracted features are divided according to frequency bands. For example, a frequency band of 16K Hz is divided into N subbands, and N subband features are obtained accordingly. N sub-encoders are used for encoding respectively.
The more subbands are divided, the finer the feature processing will be, but more sub-encoders will be introduced, which would improve the network complexity. Considering the performance and network complexity, the extracted features may be divided into 4 to 6 subband features, and 4 to 6 sub-encoders are used accordingly.
For example, by adopting a division mode of 4 subbands, according to the frequency domain data obtained by feature extraction, the data fk with 256 frequency points in each frame is divided into four subband features of f1k, f2k, f3k and f4k. The frequency points contained in each subband feature are {1˜32}, {33˜64}, {65˜128}, and {129˜256}, and the corresponding frequencies are 0˜2K, 2K˜4K, 4K˜8K, and 8K˜16K, where k={0, 1, 2 . . . , 498}, represents the frame number.
If the extracted features are not divided into subband features, the encoding module may use one encoder to encode the full-band features to obtain a higher-dimensional feature vector.
If the subband processing is adopted, the full-band features are divided into multiple subband features, and each subband feature needs to be encoded by different sub-encoders, thereby achieving parallel encoding and reducing complexity. It is assumed that the number of subband divisions is N, there are N sub-encoders corresponding to individual subband features.
With reference to
In the encoding process, the first sub-encoder may extend the dimension [499,64] of the subband feature f1k (here, it represents the subband feature with frequency points {1˜32} of 499 frames from 0 to 498) to [1,1,499,64], perform 2-dimensional convolution operation, in which the output channel of the convolutional network is 256, the convolution kernel is 5×5, and the step size is 1×1, and output the subband encoded feature vector x1k [1,256,499,64]. The second sub-encoder may extend the dimension [499,64] of the subband feature f2k (here, it represents the subband feature with frequency points {33˜64} of 499 frames from 0 to 498) to [1,1,499,64], perform 2-dimensional convolution operation, in which the output channel of the convolutional network is 256, the convolution kernel is 5×5, and the step size is 1×1, and output the subband encoded feature vector x2k [1,256,499,64]. The third sub-encoder may extend the dimension [499,128] of the subband feature f3k (here, it represents the subband feature with frequency points {65˜128} of 499 frames from 0 to 498) to [1,1,499,128], perform 2-dimensional convolution operation, in which the output channel of the convolutional network is 256, the convolution kernel is 5×6, and the step size is 1×2, and output the subband encoded feature vector x3k [1,256,499,64]. The fourth sub-encoder extends the dimension [499,256] of the subband feature f4k (here, it represents the subband feature with frequency points {129˜256} of 499 frames from 0 to 498) to [1,1,499,256], performs 2-dimensional convolution operation, in which the output channel of the convolutional network is 256, the convolution kernel is 5×6, and the step size is 1×4, and outputs the subband encoded feature vector x4k [1,256,499,64]. The above examples are illustrative only and the present disclosure is not limited thereto.
After being processed by the encoders, the dimension of each subband feature vector xik is [1,256,499,64], where i represents the ith band and k represents the kth frame.
Next, feature separation may be performed on the encoded features of the audio signal based on the determined target audio segment to obtain a feature mask corresponding to each sound source.
A modeling path may adopt a fixed path modeling method combining a local path and a global path, as shown in
Based on this, the present disclosure introduces a fusion procedure of hidden layer state information of the network (the neural network for feature separation) to solve the problem of gradually losing the speech feature information of the speaker when the speaker does not speak for a long time. The present disclosure fuses the hidden layer state information of the high-quality audio segment of each sound source searched by the audio segment search module into the current hidden layer state, so that the network may retain the speech features of each sound source, so as to solve the problem of long-term network forgetting. At the same time, the hidden layer state information of the previous audio segment may be fused to make the network better track the context information and short-term feature information, so as to ensure the continuity of the hidden layer state.
With reference to
When performing feature separation on the current audio segment, local path modeling may be performed on the current audio segment first, in which feature separation may be performed by unit for each audio unit of the current audio segment. Then, adaptive path modeling may be performed on the current audio segment, in which the hidden layer state information of the searched high-quality audio segment is fused with the hidden layer state information of the previous audio segment, and feature separation is performed on the current audio segment by using the fused hidden layer state information. In the adaptive path modeling, when feature separation is performed on an audio unit of the current audio segment, the hidden layer state information of the collocated audio units of the high-quality audio segment and the previous audio segment may be used. Here, the audio unit may refer to a feature unit in an audio segment obtained by splitting and rearranging the original encoded features. The audio unit in the local path modeling may be different from that in the adaptive path modeling.
With reference to
The fusion method of hidden layer state information may be realized by using a method of weighting. Taking the audio signal including two speakers as an example, the following equation may be used for the fusion of the hidden layer state information.
h
fusion=αq*(hA+hB)+γs-1*hs-1
Where, hA and hB respectively represent the hidden layer state information of the speakers A and B, hs-1 represents the hidden layer state information of the previous audio segment, αq represents the weight of the hidden layer state information of the speakers A and B, and γs-1 represents the weight of the hidden layer state information of the previous audio segment. Here, since the speakers A and B are equally important, the same weight αq is used, and different weights may be set according to the importance of the speakers. In addition, since the weight γs-1 of the hidden layer state information of the previous audio segment is related to the speech quality of the high-quality audio segment, if the high-quality audio segment is updated for the current audio block, the hidden layer state information of the previous audio segment may use a smaller weight, and if the high-quality audio segment is not updated for the current audio block, the hidden layer state information of the previous audio segment may use a larger weight, so that the network may obtain more contextual information and short-term features. The weights of each hidden layer state information may be set differently.
In addition, timing processing networks such as LSTM may be used to fuse the hidden layer state information. The timing processing network may learn how to fuse the hidden layer state information to achieve the best separation effect, as shown in
With reference to
As shown in
In the local path modeling, the input feature vector s_intput may be split and rearranged to obtain transverse local features (representing all features on a frame). The corresponding vector splitting mode may be defined as a local splitting mode. The operation of the local splitting mode is as follows: the feature vector s_intput after dimension reduction is split by frame, and then rearranged into a 3D feature vector. As shown in
Feature separation is performed on the feature vector v_local by the first LSTM, and a first feature vector is output through a Normalization layer. In the present disclosure, the first feature vector may be understood as a feature vector obtained by the local path modeling. In the local path modeling, the hidden layer states of the LSTM for modeling the feature vector v_local are iterated inside the LSTM to obtain the latest context information and short-term features.
In order for better modeling with respect to the first feature vector obtained from the local path modeling, in the adaptive path modeling, it may be split and rearranged again to obtain longitudinal global features (representing features of a certain frequency point on all frames). The corresponding vector splitting mode may be defined as a global splitting mode. The operation of the global splitting mode is as follows: the first feature vector is split by the unit of frame, and then rearranged into another 3D feature vector. As shown in
Feature separation is performed on the feature vector v_global by the second LSTM, and a separated feature vector s_output is output through a Normalization layer.
With reference to
When modeling for the feature vector v_global, the hidden layer states of the second LSTM may be initialized with the obtained fused hidden layer state information hfusion. By processing the data of each frame, the hidden layer states of the second LSTM may be constantly updated. The hidden layer states contain the high-quality speech features of each sound source.
In addition to the LSTM network, the present disclosure may also use CNN, Transformer and other networks for feature separation.
Next, the output feature vector s_output passes through a two-dimensional convolution layer (Conv2d), a one-dimensional convolution layer and a Tanh activation layer (Conv1d+Tanh), a one-dimensional convolution layer and a sigmoid activation layer (Conv1d+σ), and the two outputs are multiplied to obtain a feature vector [m,64,499,64]. Finally, a one-dimensional convolution layer and an activation function (Conv1d+ReLu) is used to perform dimension recovery and finally a mask for each sound source [m,256,499,64] is output, where m represents the number of sound sources to be separated, 256 represents the feature dimension of each frequency component, 499 represents the number of frames, and 64 represents the number of frequency components. For example, m=2, which indicates the speaker A and the speaker B respectively.
With reference to
In
According to another embodiment of the present disclosure, for each audio segment of an audio signal, feature separation may be performed on encoded features corresponding to a current audio segment based on a target audio segment determined for a previous audio block of an audio block where the current audio segment is located and a previous audio segment of the current audio segment, to obtain a feature mask corresponding to each sound source. Because the neural network is used for feature separation, the hidden state information of the network may also express the features of the speech signal. Therefore, the hidden layer state information of the target audio segment and the previous audio segment may be acquired. Here, the hidden layer state information is obtained during feature separation of the target audio segment and the previous audio segment and includes at least one of short-term speech features, long-term speech features and context features of each sound source. Then, the hidden layer state information of the target audio segment and the previous audio segment is fused to obtain fused hidden layer state information, and the encoding features corresponding to the current audio segment is separated based on the fused hidden layer state information. Local path modeling and adaptive path modeling may be used to separate the features of each audio segment.
For example, the encoding features corresponding to the current audio segment may include multiple audio units, intra-frame processing may be performed first, that is, feature separation is performed unit by unit to obtain first separated features corresponding to the current audio segment, and then inter-frame processing is performed, that is, feature separation is performed on the first separated features based on the fused hidden layer state information, to obtain the feature mask for each sound source of the current audio segment.
The above method for feature separation of the audio signal may be applied to the case without frequency band division and also to the case of subband processing.
The audio signal may be decoded based on the feature mask to obtain a separated speech signal corresponding to each sound source. For example, the mask obtained by the separation module and the feature vector of the audio signal output by the encoding module may be dot multiplied, and the feature decoding may be further carried out to recover the separated time domain signals of each sound source.
If the extracted features are not divided into subband features, the decoding module may use one decoder to decode the full-band features to obtain the separated speech signals of each sound source.
If the subband processing is adopted, the decoding module may use different sub-decoders to decode, thereby achieving parallel decoding and reducing complexity. It is assumed that there are N sub-encoders corresponding to individual subband features, there are N sub-decoders to decode each subband feature.
With reference to
The sub-decoder may be implemented by a linear full connection layer, or it may use other networks (such as CNN) for feature conversion to calculate the predicted features of the target sound source.
The speech separation technology of the present disclosure may be applied to intelligent meeting minutes, audio and video editing, speech calls and other common scenes in life.
For example, when multiple people are in a meeting, their speeches may be separated in real time and subsequent processing (such as real-time transcription, recognition, translation, etc.) is performed thereon. As shown in
In addition, the speech separation technology proposed in the present disclosure may be applied to video/audio editors in smart phones to edit sounds required by users in video/audio. For example, the speech separation technology of the present disclosure may separate each speaker's speech in video/audio (such as a speaker A, a speaker B, background noise, etc.). As shown in
The speech processing model of the present disclosure may include an encoding module, an audio segment search module, an adaptive path separation module and a decoding module. The separation module may include a first separation module (also known as a local path modeling module), a hidden layer state information fusion module and a second separation module (also known as an adaptive path modeling module), and each module may be realized by a neural network.
With reference to
In step S2202, the audio signal is encoded by the encoding module to obtain encoding features of the audio signal.
In step S2203, the audio segment search module determines a target audio segment in the audio signal, where the target audio segment may be determined based on speech quality of each audio segment divided from the audio signal. The speech quality may include at least one of speech distortion, signal-to-noise ratio, zero crossing rate, and pitch quantity.
As an example, the audio signal may be divided into multiple audio blocks according to a first time period and each audio block may be divided into multiple audio segments according to a second time period. The target audio segment of the audio signal may be determined for each audio block.
For example, for each audio block, a high-quality audio segment for a current audio block may be determined based on the audio quality of individual audio segments in the audio block. A high-quality audio segment for the current audio block may be used for feature separation of the current audio block or feature separation of the next audio block.
For another example, for each sound source that has been separated, whether a current audio block belongs to a target audio block is determined based on the speech quality of the current audio block. Whether the speech quality of a current audio segment in the current audio block is higher than that of a previous audio segment of the current audio segment is determined in a case where the current audio block belongs to the target audio block. Whether the speech quality of the current audio segment is higher than that of a target audio segment determined for a previous audio block of the current audio block is determined in a case where the speech quality of the current audio segment is higher than that of the previous audio segment. A target audio segment corresponding to each sound source for the current audio block is determined based on the comparison result. For example, the current audio segment is determined as the target audio segment in a case where the speech quality of the current audio segment is higher than that of the target audio segment determined for the previous audio block. The current audio segment is determined as the target audio segment if a difference between the speech quality of the current audio segment and that of the target audio segment determined for the previous audio block is less than a preset threshold and a time interval between the current audio segment and the target audio segment determined for the previous audio block is greater than a time threshold, in a case where the speech quality of the current audio segment is lower than that of the target audio segment determined for the previous audio block.
As another example, the speech quality of each audio segment in the current audio block may be determined in a case where the current audio block belongs to the target audio block, and a first audio segment with the highest speech quality may be selected. Whether the speech quality of the first audio segment is higher than that of a target audio segment determined for a previous audio block of the current audio block is then determined. A target audio segment corresponding to each sound source for the current audio block is determined based on the comparison result. For example, the first audio segment may be determined as the target audio segment in a case where the speech quality of the first audio segment is higher than that of the target audio segment determined for the previous audio block. The first audio segment may be determined as the target audio segment if a difference between the speech quality of the first audio segment and that of the target audio segment determined for the previous audio block is less than a preset threshold and a time interval between the first audio segment and the target audio segment determined for the previous audio block is greater than a time threshold, in a case where the speech quality of the first audio segment is lower than that of the target audio segment determined for the previous audio block.
The speech quality may include at least one of speech distortion, signal-to-noise ratio, zero crossing rate and pitch quantity, where the speech distortion is determined by calculating correlation between a separated speech signal for an audio segment and a reference audio signal, wherein the reference audio signal is an audio signal obtained by subtracting the separated speech signal from an original audio signal corresponding to the audio segment. The signal-to-noise ratio is determined by calculating a ratio between a separated speech signal for an audio segment and an original audio signal corresponding to the audio segment.
In step S2204, the separation module performs feature separation on the encoded features based on the target audio segment, and obtains a feature mask corresponding to each sound source.
Speech separation is performed on the encoded features corresponding to the current audio segment based on the target audio segment determined for the previous audio block of the audio block where the current audio segment is located and the previous audio segment of the current audio segment, to obtain the feature mask corresponding to each sound source. For example, hidden layer state information of the target audio segment and the previous audio segment may be obtained. The hidden layer state information is obtained when the target audio segment and the previous audio segment are performed speech separation by the separation module and includes at least one of short-term speech features, long-term speech features and context features of each sound source. The hidden layer state information of the target audio segment and the previous audio segment is fused to obtain fused hidden layer state information. Speech separation is performed on the encoded features corresponding to the current audio segment based on the fused hidden layer state information.
For example, the encoded features corresponding to the current audio segment include a plurality of audio units. The first separation module performs speech separation for each audio unit, to obtain a first separated features corresponding to the current audio segment; the second separation module performs speech separation on the first separation features based on the fused hidden layer state information, to obtain the feature mask of the current audio segment for each sound source.
In step S2205, the audio signal is decoded by the decoding module based on the feature mask to obtain a separated speech signal corresponding to each sound source.
In step S2206, network parameters of the encoding module, the audio segment search module, the separation module and the decoding module are adjusted based on the obtained speech signal and the corresponding separated speech signal.
For example, a loss function may be configured based on the obtained speech signal (the real signal) and the corresponding separated speech signal (the predicted signal), and the network parameters of each module may be adjusted by minimizing a loss calculated by the loss function.
Referring to
The acquisition module 2301 may obtain an audio signal to be processed, wherein the audio signal includes a speech signal uttered by at least one sound source.
The encoding module 2302 may encode the audio signal to obtain encoded features of the audio signal.
The search module 2303 may determine a target audio segment of the audio signal, wherein the target audio segment may be determined based on speech quality of each audio segment divided from the audio signal.
The separation module 2304 may perform speech separation on the encoded features according to the target audio segment to obtain a feature mask corresponding to each sound source.
The decoding module 2305 may decode the audio signal based on the feature mask to obtain a separated speech signal corresponding to each sound source.
Alternatively, the search module 2303 may divide the audio signal into a plurality of audio blocks according to a first time period and divide each audio block into a plurality of audio segments according to a second time period, and determine a target audio segment for each audio block with respect to each sound source.
The separation module 2304 may perform speech separation on the encoded features corresponding to a current audio segment based on a target audio segment determined for a previous audio block of an audio block where the current audio segment is located and a previous audio segment of the current audio segment, to obtain the feature mask corresponding to each sound source.
Alternatively, for each sound source that has been separated, the search module 2303 may determine whether a current audio block belongs to a target audio block based on the speech quality of the current audio block, determine whether the speech quality of a current audio segment in the current audio block is higher than that of a previous audio segment of the current audio segment in a case where the current audio block belongs to the target audio block, determine whether the speech quality of the current audio segment is higher than that of a target audio segment determined for a previous audio block of the current audio block in a case where the speech quality of the current audio segment is higher than that of the previous audio segment, and determine a target audio segment corresponding to each sound source for the current audio block based on the comparison result.
Alternatively, the search module 2303 may determine the current audio segment as the target audio segment in a case where the speech quality of the current audio segment is higher than that of the target audio segment determined for the previous audio block, and determine the current audio segment as the target audio segment if a difference between the speech quality of the current audio segment and that of the target audio segment determined for the previous audio block is less than a preset threshold and a time interval between the current audio segment and the target audio segment determined for the previous audio block is greater than a time threshold, in a case where the speech quality of the current audio segment is lower than that of the target audio segment determined for the previous audio block.
Alternatively, for each sound source that has been separated, the search module 2303 may determine whether a current audio block belongs to a target audio block based on the speech quality of the current audio block, determine the speech quality of each audio segment in the current audio block in a case where the current audio block belongs to the target audio block, and select a first audio segment with the highest speech quality, determine whether the speech quality of the first audio segment is higher than that of a target audio segment determined for a previous audio block of the current audio block, and determine a target audio segment corresponding to each sound source for the current audio block based on the comparison result.
Alternatively, the speech quality includes at least one of speech distortion, signal-to-noise ratio, zero crossing rate and pitch quantity.
The speech distortion is determined by calculating correlation between a separated speech signal for an audio segment and a reference audio signal, wherein the reference audio signal is an audio signal obtained by subtracting the separated speech signal from an original audio signal corresponding to the audio segment. The signal-to-noise ratio is determined by calculating a ratio between a separated speech signal for an audio segment and an original audio signal corresponding to the audio segment.
Alternatively, the separation module 2304 may obtain hidden layer state information of the target audio segment and the previous audio segment, wherein the hidden layer state information is obtained when the target audio segment and the previous audio segment are performed speech separation respectively, and includes at least one of short-term speech features, long-term speech features and context features of each sound source. The separation module 2304 may fuse the hidden layer state information of the target audio segment and the previous audio segment to obtain fused hidden layer state information, and perform speech separation on the current audio segment based on the fused hidden layer state information.
Alternatively, the current audio segment includes a plurality of audio units, and the separation module 2304 may perform speech separation for each audio unit, to obtain first separated features corresponding to the current audio segment, and perform speech separation on the first separation features based on the fused hidden layer state information, to obtain the feature mask of the current audio segment for each sound source.
The speech processing process has been described in detail above with respect to
Referring to
The acquisition unit 2401 may obtain a training sample. The training sample includes a speech signal uttered by at least one sound source in a noiseless environment and an audio signal composed of the speech signal and a noise signal.
The training unit 2402 may encode the audio signal by the encoding module to obtain encoding features of the audio signal, determine a target audio segment in the audio signal by the audio segment search module, where the target audio segment includes speech features capable of identifying the corresponding sound source. The training unit 2402 may perform feature separation on the encoded features based on the target audio segment by the separation module and obtain a feature mask corresponding to each sound source, decode the audio signal by the decoding module based on the feature mask to obtain a separated speech signal corresponding to each sound source, and adjust network parameters of the encoding module, the audio segment search module, the separation module and the decoding module based on the obtained speech signal and the corresponding separated speech signal.
The model training process has been described in detail above with respect to
As shown in
Those skilled in the art will appreciate that the configuration shown in
As shown in
In the speech processing apparatus 2500 shown in
The processing component 2501 may include at least one processor, and the memory 2505 stores a set of computer-executable instructions that, when being executed by the at least one processor, execute the speech processing method and the model training method according to the embodiments of the present disclosure. In addition, the processing component 2501 may execute the speech processing process or the model training process and the like. However, the above examples are only exemplary and the present disclosure is not limited thereto.
As an example, the speech processing apparatus 2500 may be a PC computer, a tablet device, a personal digital assistant, a smart phone, or any other device capable of executing the above instruction set. Here, the speech processing apparatus 2500 does not have to be a single electronic device, but may also be any set of devices or circuits capable of executing the above instructions (or instruction set) individually or jointly. The speech processing apparatus 2500 may also be a part of an integrated control system or system manager, or may be configured as a portable electronic device that interfaces locally or remotely (e.g., via wireless transmission).
In the speech processing apparatus 2500, the processing component 2501 may include a central processing unit (CPU), graphics processing unit (GPU), programmable logic device, special purpose processor system, microcontroller or microprocessor. By way of example and not limitation, the processing component 2501 may also include analog processors, digital processors, microprocessors, multi-core processors, processor arrays, network processors, and the like.
The processing component 2501 may execute instructions or code stored in the memory 2505, which may also store data. Instructions and data may also be sent and received over a network via a network interface 2503, which may employ any known transport protocol.
The memory 2505 may be integrated with the processor, e.g., a RAM or flash memory is arranged within an integrated circuit microprocessor or the like. Additionally, the memory 2505 may include a separate device such as an external disk drive, storage array, or any other storage device that may be used by a database system. The memory and the processor may be operatively coupled, or may communicate with each other, e.g., through I/O ports, network connections, etc., to enable the processor to read files stored in the memory.
An electronic device may be provided in accordance with the embodiments of the present disclosure.
The processor 2601 may include a central processing unit (CPU), an audio and video processor, a programmable logic device, a dedicated processor system, a microcontroller, or a microprocessor. By way of example and not limitation, the processor 2601 may also include analog processors, digital processors, microprocessors, multi-core processors, processor arrays, network processors, and the like.
As a storage medium, the memory 2602 may include an operating system (such as MAC operating system), a data storage module, a network communication module, a user interface module, a recommendation module and database.
The memory 2602 may be integrated with the processor 2601, e.g., a RAM or flash memory is arranged within an integrated circuit microprocessor or the like. Additionally, the memory 2602 may include a separate device such as an external disk drive, storage array, or any other storage device that may be used by a database system. The memory 2602 and the processor 2601 may be operatively coupled, or may communicate with each other, e.g., through I/O ports, network connections, etc., to enable the processor 2601 to read files stored in the memory 2602.
In addition, the electronic device 2600 may also include video displays (e.g. liquid crystal display) and user interaction interfaces (e.g. keyboard, mouse, touch input device, etc.). All components of the electronic device 2600 may be connected to each other via a bus and/or a network.
As an example, the electronic device 2600 may be a PC computer, a tablet device, a personal digital assistant, a smart phone, or any other device capable of executing the above instruction set. Here, the electronic device 2600 does not have to be a single electronic device, but may also be any set of devices or circuits capable of executing the above instructions (or instruction set) individually or jointly. The electronic device 2600 may also be a part of an integrated control system or system manager, or may be configured as a portable electronic device that interfaces locally or remotely (e.g., via wireless transmission).
It is understandable to those skilled in the art that the structure shown in
At least one of the above multiple modules may be implemented by an AI model. Functions associated with AI may be performed by a non-volatile memory, a volatile memory, and a processor.
The processor may include one or more processors. At this time, one or more processors may be general-purpose processors such as central processing units (CPUs), application processors (APs), etc., processors only used for graphics such as graphics processors (GPUs), vision processors (VPU), and/or AI-specific processors such as neural processing units (NPUs).
One or more processors control processing of inputting data according to predefined operating rules or artificial intelligence (AI) models stored in the non-volatile memory and the volatile memory. The predefined operating rules or the artificial intelligence models may be provided through training or learning. Here, providing by learning means that by applying a learning algorithm to a plurality of learning data, a predefined operating rule or AI model with desired properties is formed. Learning may be performed in an AI executing device itself according to an embodiment, and/or may be implemented by a separate server/device/system.
A learning algorithm is a method of using a plurality of learning data to train a predetermined target device (e.g., a robot) to cause, allow or control the target device to make a determination or prediction. Examples of the learning algorithm include, but are not limited to, supervised learning, unsupervised learning, semi-supervised learning, or reinforcement learning.
According to the present disclosure, in the speech processing method executed by the electronic device, the output speech after processing the target region may be obtained by taking the input speech as the input data of the artificial intelligence model.
An AI model may be obtained by training. Here, “obtained by training” refers to training a basic artificial intelligence model with a plurality of training data through a training algorithm, thereby obtaining a predefined operating rule or artificial intelligence model configured to perform required characteristics (or purposes).
As an example, an artificial intelligence model may include a plurality of neural network layers. Each of the plurality of neural network layers includes a plurality of weight values, and neural network calculation is performed by a calculation between calculation results of a previous layer and the plurality of weight values. Examples of neural networks include, but are not limited to, convolutional neural networks (CNN), deep neural networks (DNN), recurrent neural networks (RNN), restricted boltzmann machines (RBM), deep belief networks (DBN), bidirectional recurrent deep neural networks (BRDNN), generative adversarial networks (GAN), and deep Q-networks.
According to an embodiment of the present disclosure, a computer readable storage medium storing a computer program is also provided. The computer program, when executed by at least one processor, causes the at least one processor to perform the above speech processing method and the model training method according to the exemplary embodiments of the present disclosure. Examples of computer-readable storage media herein include: Read Only Memory (ROM), Random Access Programmable Read Only Memory (RAPROM), Electrically Erasable Programmable Read Only Memory (EEPROM), Random Access Memory (RAM), Dynamic Random Access Memory (DRAM), Static Random Access Memory (SRAM), flash memory, non-volatile memory, CD-ROM, CD-R, CD+R, CD-RW, CD+RW, DVD-ROM, DVD-R, DVD+R, DVD-RW, DVD+RW, DVD-RAM, BD-ROM, BD-R, BD-R LTH, BD-RE, Blue-ray or optical disk storage, Hard Disk Drive (HDD), Solid State Drive (SSD), card storage (such as multimedia cards, secure digital (SD) cards or extremely fast digital (XD) cards), magnetic tapes, floppy disks, magneto-optical data storage devices, optical data storage devices, hard disks, solid state disks, and any other devices that are configured to store computer programs and any associated data, data files and data structures in a non-transitory manner and provide the computer programs and any associated data, data files and data structures to a processor or computer so that the processor or computer can execute the computer programs. The instructions or computer programs in the computer-readable storage medium described above may be executed in an environment deployed in a computer device. In addition, in one example, the computer programs and any associated data, data files, and data structures are distributed on a networked computer system, so that the computer programs and any associated data, data files, and data structures are stored, accessed and executed through one or more processors or computers in a distributed manner.
A computer program product may also be provided in accordance with the embodiment of the present disclosure. Instructions in the computer program product may be executed by a processor of a computer device to complete the speech processing method and the model training method.
After considering the specification and the practice of the present disclosure, those skilled in the art will readily conceive of other implementations of the present disclosure. This application is intended to cover any variation, use or adaptation of the present disclosure that follows the general principles of the present disclosure and includes the common knowledge or customary technical means in the field of technology not disclosed by the present disclosure. The specification and embodiments are deemed to be exemplary only, and the true scope and spirit of the present disclosure are indicated by the claims below.
It should be understood that the present disclosure is not limited to the precise structure already described above and shown in the attached drawings and is subject to various modifications and changes within its scope. The scope of the present disclosure is limited only by the attached claims.
Number | Date | Country | Kind |
---|---|---|---|
202211505381.X | Nov 2022 | CN | national |