This application claims the priority benefit of China application serial no. 202310963031.6, filed on Jul. 31, 2023. The entirety of the above-mentioned patent application is hereby incorporated by reference herein and made a part of this specification.
The disclosure relates to a technical field of voice processing, and in particular to a multi-speaker overlapping voice detection method and a system thereof.
With the rapid development of information technology, artificial intelligence technology has also been improved in various fields. Recently, the research on speaker logs (segmentation and clustering) has also been widely conducted. However, most studies ignore the overlapping audio in audio and do not consider the impact of overlapping audio. In daily meetings or classrooms, there are many cases where the audio of speakers overlapping. Moreover, many studies have found that overlapping voice detection can greatly improve the accuracy of speaker logs.
The existing overlapping voice detection research usually judges the overlapping and non-overlapping of audio, but is unable to determine how many speakers speak simultaneously in the overlapping audio, which has certain limitations. Determining how many speakers speak simultaneously is very helpful for analyzing the extent of confusion in classrooms and meetings, speaker logs, and research on overlapping perception.
In view of the defects of the related art, the object of the disclosure is to provide a multi-speaker overlapping voice detection method and a system thereof, aiming to solve a problem that a conventional overlapping voice detection method may not detect a number of speakers speaking simultaneously in an overlapping voice.
To achieve the above objectives, in a first aspect, the disclosure provides a multi-speaker overlapping voice detection method. The method includes the following steps. A voice to be detected is obtained, and removing silence from the voice to be detected is removed. A feature of the voice to be detected is extracted after the silence removed, to obtain a voice feature of the voice to be detected. The voice feature is input into an overlapping voice detection model to obtain an overlapping speaker number corresponding to the voice to be detected output by the overlapping voice detection model. The overlapping speaker number represents a number of speakers speaking simultaneously in the voice to be detected. The overlapping voice detection model is obtained by a supervised training based on a voice feature of a sample voice and a corresponding label of the overlapping speaker number, and the overlapping voice detection model extracts an embedding of the voice feature and classifies the overlapping speaker number to obtain the overlapping speaker number of the voice to be detected based on the overlapping speaker number classified by an extracted speaker embedding.
In an optional example, the overlapping voice detection model includes an embedding extraction model and an overlapping speaker number classification model. The embedding extraction model is trained based on following steps. A first classification model is trained based on the voice feature of the sample voice and the corresponding label of the overlapping speaker number, where the first classification model includes an embedding extraction part and a classification part. The embedding extraction part of the trained first classification model is used as an embedding extraction model. The overlapping speaker number classification model is trained based on following steps. The voice feature of the sample voice is input into the embedding extraction model to obtain a sample speaker embedding. A second classification model is trained to obtain the overlapping speaker number classification model based on the sample speaker embedding and the corresponding label of the overlapping speaker number.
In an optional example, the first classification model sequentially includes five layers of time-delay neural networks, one statistical pooling layer, two layers of fully connected layers, and one activation function layer, wherein the five layers of time-delay neural networks, the one statistical pooling layer, and a first fully connected layer are the embedding extraction part, and the rest are the classification part.
In an optional example, the second classification model sequentially comprises four layers of one-dimensional convolutional neural networks, two layers of long and short-term memory recurrent neural networks, one fully connected layer, and one activation function layer.
In an optional example, the sample voice includes a sample individual voice of a single speaker and a sample overlapping voice of multiple speakers. The sample individual voice and the sample overlapping voice are obtained based on following steps. An individual voice of any speaker is divided into various sub-bands, and the silence form the individual voice of any speaker is removed based on an energy of each sub-band and a preset energy threshold. A data set is constructed based on an individual voice of each speaker after the silence is removed. An individual voice of a single speaker is selected from the data set as a sample individual voice. An individual voice of the speakers randomly is selected from the data set and superimposed to obtain the sample overlapping voice of the speakers.
In a second aspect, a multi-speaker overlapping voice detection system includes a voice processing module, a feature extraction module, and an overlapping voice detection module. The voice processing module is configured to obtain a voice to be detected and remove silence from the voice to be detected. The feature extraction module is configured to extract a feature of the voice to be detected after the silence is removed, so as to obtain a voice feature of the voice to be detected. The overlapping voice detection module is configured to input the voice feature into an overlapping voice detection model to obtain an overlapping speaker number corresponding to the voice to be detected output by the overlapping voice detection model, where the overlapping speaker number represents a number of speakers speaking simultaneously in the voice to be detected. An overlapping voice detection model is obtained by a supervised training based on a voice feature of a sample voice and a corresponding overlapping speaker number label, the overlapping voice detection model performs embedding extraction on the voice feature, and classifies the overlapping speaker number based on an extracted speaker embedding to obtain the overlapping speaker number of the voice to be detected.
In an optional example, the overlapping voice detection model includes an embedding extraction model and an overlapping speaker number classification model. Correspondingly, the system further includes an embedding extraction training module and a classification training module. The embedding extraction training module is configured to train a first classification model based on the voice feature of the sample voice and a corresponding label of the overlapping speaker number, where the first classification model comprises an embedding extraction part and a classification part; and use the embedding extraction part of the trained first classification model as an embedding extraction model. The classification training module is configured to input the voice feature of the sample voice into the embedding extraction model to obtain a sample speaker embedding, and train a second classification model to obtain an overlapping speaker number classification model based on the sample speaker embedding and the corresponding label of the overlapping speaker number.
In an optional example, the first classification model of the embedding extraction training module sequentially includes five layers of time-delay neural networks, one statistical pooling layer, two layers of fully connected layers, and one activation function layer, wherein the five layers of time-delay neural networks, the one statistical pooling layer, and a first fully connected layer are the embedding extraction part, and the rest are the classification part.
In an optional example, the second classification model of the classification training module sequentially comprises four layers of one-dimensional convolutional neural networks, two layers of long and short-term memory recurrent neural networks, one fully connected layer, and one activation function layer.
In an optional example, the sample voice includes a sample individual voice of a single speaker and a sample overlapping voice of the speakers. The system further includes a sample voice acquisition module and is configured to divide the individual voice of any speaker into various sub-bands, remove the silence form an individual voice of any speaker based on an energy of each sub-band and a preset energy threshold, construct a data set based on an individual voice of each speaker after silence removal, select an individual voice of a single speaker from the data set as a sample individual voice, and select an individual voice of the speakers randomly from the data set and superimpose to obtain the sample overlapping voice of the speakers.
In order to make the purpose, technical solutions, and advantages of the disclosure comprehensible, the disclosure is further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the disclosure, and are not used to limit the disclosure.
The purpose of the disclosure is to detect and identify overlapping audio of an overlapping voice detection algorithm based on an x-vector (that is, a speaker embedding). Whether it is a study of speaker logs or a study of overlapping perception, detecting the overlapping audio can improve the prediction accuracy of a model. The detection and the identification are based on the differences in acoustic features between the overlapping audio and non-overlapping audio and the differences in overlapping speakers. Only when the detection is accurate enough may other research be conducted later, such as meeting research on the extent of meeting voice and discussions, and a classroom-related analysis of the conversations between teachers and students and the discussions of the students may be done.
In this regard, the disclosure provides a multi-speaker overlapping voice detection method.
Preferably, a voice activity detection algorithm may be adopted to remove the silence from the voice to be detected, where the voice to be detected is the voice for which the overlapping voice detection is required. After the silence from the voice to be detected is removed, a Mel-frequency cepstral coefficient (MFCC) feature extraction may be adopted to extract the feature of the voice to be detected to obtain the voice feature of the voice to be detected. Since the MFCC feature that best matches the auditory feature of the human ear is adopted, the MFCC feature may preserve the feature of the speaker and help the subsequent overlapping voice detection model extract the speaker embedding.
It should be understood that the overlapping speaker number output by the overlapping voice detection model may be one or multiple, for example, two or five. The overlapping voice detection model may detect the overlapping speaker number corresponding to the voice, which may realize the detection of the overlapping voice or the non-overlapping voice and also realize the detection of the number of speakers speaking simultaneously in the overlapping voice. For the subsequent analysis of the extent of confusion in classrooms and meetings, speaker logs, and research on the overlapping perception, the detection of the overlapping voice detection model is a great help.
The method provided by the embodiment of the disclosure obtains the voice feature by removing the silence and extracting the feature of the voice to be detected, extracts the embedding of the voice feature by the overlapping voice detection model, and then classifies the overlapping speaker number to obtain the overlapping speaker number of the voice to be detected based on the extracted speaker embedding, which realizes the detection of the number of speakers speaking simultaneously in the overlapping voice. Furthermore, since the speaker embedding is extracted for the voice, the accuracy of the overlapping voice detection is improved, which has a certain positive impact on the accuracy of speaker logs and overlapping perception tasks.
Based on any of the above embodiments, the overlapping voice detection model includes an embedding extraction model and an overlapping speaker number classification model.
It should be noted that the adoption goes to train the embedding extraction model first, and then train the overlapping speaker number classification model based on the sample speaker embedding extracted by the embedding extraction model, thereby finally obtaining the overlapping voice detection model, which can further improve the accuracy of the overlapping voice detection.
Based on any of the above embodiments, the first classification model sequentially includes five layers of time-delay neural networks, one statistical pooling layer, two layers of fully connected layers, and one activation function layer, where five layers of time-delay neural networks, one statistical pooling layer, and a first fully connected layer are the embedding extraction part, and the rest are the classification part.
Preferably, when the first classification model is trained, a loss function may adopt cross entropy (CE) to obtain a model with the best training effect as the first classification model that is finally trained, which obtains an embedding extraction model with better performance. The time-delay neural network tdnn may include context acquisition, a fully connected layer, and a Relu activation function.
It should be noted that the input of each layer of the time-delay neural network tdnn is composed of historical, current, and future features, thereby introducing timing information, which may ensure that the speaker embedding may read the timing audio information and further improve the accuracy of the overlapping voice detection.
Based on any of the above embodiments, in order to further improve the accuracy of the overlapping voice detection, the second classification model may sequentially include four layers of one-dimensional convolutional neural networks, two layers of long and short-term memory recurrent neural networks, one fully connected layer, and one activation function layer.
Preferably, the one-dimensional convolutional neural network sequentially includes a one-dimensional convolution, a normalization layer, a relu activation function, and a maximum pooling layer. The fully connected layer sequentially includes a linear layer, a relu activation function, and a linear layer. The activation function layer may adopt a softmax activation function.
Based on any of the above embodiments, the sample voice includes a sample individual voice of a single speaker and a sample overlapping voice of multiple speakers. The sample single voice and the sample overlapping voice are obtained based on the following steps. The individual voice of any speaker is divided into various sub-bands, and silence from the individual voice of the speaker is removed based on the energy of each sub-band and a preset energy threshold. A data set is constructed based on the individual voice of each speaker after the silence is removed. The individual voice of the single speaker is selected from the data set as a sample individual voice. The individual voice of the speakers is randomly selected from the data set and superimposed to obtain sample overlapping voice of the speakers.
It should be noted that there is no public overlapping audio data set currently, and most data sets generated by researchers are the overlapping audio of the single speaker and two speakers, which may not cover the scenario of the speakers overlapping. Therefore, the disclosure first needs to generate the sample overlapping voice. Specifically, the energy-based voice activity detection algorithm is adopted first to detect a silence part in the individual voice data and remove the silence. That is, the collected individual voice of each speaker is divided into various sub-bands first, and the energy of each sub-band is calculated. According to the energy of each sub-band and the preset energy threshold, the silence from the individual voice of each speaker is removed, so that the individual voice of each speaker is obtained after the silence is removed, and thus the data set is constructed.
Subsequently, the individual voice of the single speaker may be directly selected from the data set as the sample individual voice, and the individual voice of the speakers may be randomly selected from the data set and superposed to obtain the sample overlapping voice of the speakers, so that sample voice of different overlapping speaker numbers may be obtained for the subsequent multi-classification overlapping voice detection model training.
It should be understood that the label of the overlapping speaker number corresponding to the sample individual voice is one, and the label of the overlapping speaker number corresponding to the sample overlapping voice may be directly determined according to a number of individual voice selected by the superposition operation. For example, as the sample overlapping voice corresponding to the label of the overlapping speaker number is three, three individual voice of different speakers from the data set needs to be selected for superposition. The specific superposition method may select two individual voice for superposition first, and then superimpose the third individual voice.
Further,
Based on any of the above embodiments, the disclosure is described in detail by taking a case where a maximum overlapping speaker number is five as an example. The main purpose of the disclosure is to perform feature extraction and identification on a segment of the audio based on the x-vector, the convolutional neural network (CNN), and the long and short-term memory recurrent neural network (LSTM), and divide the identification result into five categories: single (single speaker), two (two speakers), three (three speakers), four (four speakers), and five (five speakers), so as to provide the overlapping detection for the subsequent speaker logs.
The overlapping voice detection method of the disclosure is mainly divided into two parts: data processing of the audio data set and an XCLSnet five-classification model. The classification model has five steps: extracting the feature training the embedding extraction model, extracting the embedding, training the five-classification model, and obtaining results. The overall processing flow is as follows. First, the audio is input into the overlapping voice detection system to remove the silence from the audio by the voice activity detection algorithm. Afterwards, all the audio is randomly overlapped to generate overlapping samples of one to five speakers, extract the x-vector embedding, and finally obtain the identification result according to the classification model. The result is divided into five categories: single speaker, two speakers, three speakers, four speakers, and five speakers, and finally the audio overlapping situation is analyzed.
Since there is no public overlapping audio data set currently, the overlapping audio needs to be generated. First, a voice activity detection (VAD) algorithm is adopted to detect the silent part in the audio and remove the silence. Afterwards, the data set is divided into two parts by random selection, and the overlapping operation is performed in the audio respectively. As the data set is from a real environment, no noise superposition is performed.
The criterion of the energy-based voice activity detection is to detect the signal strength and assume that the voice energy is greater than the background noise energy. In this way, when the voice energy is greater than a certain threshold, the voice may be consider to exist. First, a broadband voice is divided into various sub-bands, and a square sum of the energy of each frame is calculated by a spectrum diagram, and then the energy of each sub-band is calculated. A formula is as follows.
In the formula, true means the detection is silence, and false means the detection is not silence. If the energy E is less than the energy threshold T, then f(x) is true silence, otherwise f(x) is false non-silence. The energy threshold T may be set as 15 according to experience. A calculation formula of sub-band energy E is as follows.
In the calculation formula, V and L are the square sum of energy of each frame and the length of the sub-band respectively.
The silence and non-silence parts in the audio are determined according to the aforementioned algorithm, and then the overall audio is detected and determined sub-band by sub-band, so as to achieve an effect of distinguishing silence and non-silence. As shown in
Random overlapping first randomly selects the single speaker from the data set for overlapping, and then superimposes an audio 1 on an audio 2 by the overlay( ) method of AudioSegment to ensure that the entire audio is full of overlapping voice, as shown in
The overlapping audio data set generated based on a public data set voxceleb1 is shown in the following table.
The XCLSnet model is mainly divided into the embedding extraction and the five-classification.
After the audio data is processed, the feature is extracted for each audio segment. First, the MFCC, which best matches the auditory feature of the human ear, is used in the feature extraction to obtain a 24-dimensional feature vector preserving the feature of the speaker to realize the subsequent x-vector embedding training.
As shown in
The x-vector model (that is, the aforementioned first classification model) includes five layers of time-delay neural networks tdnn, one statistical pooling layer, two layers of fully connected layers, and a softmax activation function. The loss function is CE. A formula is as follows.
In the formulas, xt represents the MFCC feature vector of the voice signal at time t, (·) represents a mapping function of the x-vector model after five layers of time-delay neural networks tdnn, one statistical pooling layer, and one fully connected layer, and y represents an output vector of the model.
A structure of the x-vector model is shown in the following table. Each frame layer has a context feature.
In Table 2, the frame layer is a time-delay neural network tdnn, including context acquisition, a fully connected layer, and a Relu activation function:
As shown in
As shown in
In the formula, hT represents the last hidden state of LSTM, W and b are the weight and biase of the classification model respectively, and is the softmax activation function. The output of the one-dimensional convolution and the LSTM layer may be expressed as follows.
In the formulas, ht is the hidden state of the LSTM layer at time step t, z is the vector output by the last convolutional layer, wj is the weight of the convolution kernel, k is the size of the convolution kernel, and zi is the output of the convolutional layer at position i.
It should be noted that the disclosure processes the audio data by the VAD algorithm and the random overlapping voice to generate a five-classification overlapping voice data set based on a voxceleb1 public data set, which is convenient for the subsequent research on the overlapping voice detection with more categories. The disclosure proposes the overlapping voice detection model, XCLSnet, to remove the silence from the audio by the VAD algorithm, imports the audio into the XCLSnet model, adopts the x-vector embedding extraction model to extract the x-vector embedding from the audio, and then inputs the embedding into the CNN and the LSTM network to obtain a classification result of the speakers overlapping of the audio. The classification result has a certain positive impact on the accuracy of speaker logs and overlapping perception tasks.
Based on any of the above embodiments, the model is in a data set based on the random overlapping of voxceleb, with a learning rate of 0.0001, an epoch of 100, and a batch_size of 32.
In Table 3, 1v1 represents the two-classification overlapping voice detection task of distinguishing overlapping and non-overlapping, and 1v5 represents the five-classification overlapping voice detection task. As can be seen from the table, the effect of the proposed XCLSnet model is 3% better than the effect of the Bcnn model for 1v1, and is 27% better than the effect of the Bcnn model for 1v5 of the speakers overlapping detection. The reason is that the x-vector embedding is extracted from the audio, and the LSTM network is added to increase the timing information.
Based on any of the above embodiments, the disclosure provides a multi-speaker overlapping voice detection system.
It should be understood that the detailed functional implementation of each of the aforementioned modules may be found in the introduction of the aforementioned method embodiment, and is not repeated herein.
In addition, an embodiment of the disclosure provides another multi-speaker overlapping voice detection device, which includes a memory and a processor. The memory is configured to store a computer program. The processor is configured to implement the method in the aforementioned embodiment when executing the computer program.
In addition, the disclosure further provides a computer-readable storage medium, on which a computer program is stored. When the computer program is executed by the processor, the method in the aforementioned embodiment is implemented.
Based on the method in the aforementioned embodiment, an embodiment of the disclosure provides a computer program product. When the computer program product runs on the processor, the processor executes the method in the aforementioned embodiment.
Based on the above, the disclosure can be implements as the following:
It will be easily understood by people skilled in the art that the above description is only a preferred embodiment of the disclosure and is not intended to limit the disclosure. Any modifications, equivalent substitutions, and improvements made within the spirit and principles of the disclosure should be included in the protection scope of the disclosure.
| Number | Date | Country | Kind |
|---|---|---|---|
| 202310963031.6 | Jul 2023 | CN | national |