Speech recognition may be used, for example, to automatically transcribe audio signals or recognize speech-based input commands. Such techniques typically assume that a single speaker is speaking at any given time. However, in many real-life scenarios, such as in a meeting or conversation, several speakers may occasionally speak simultaneously.
Conventional speech recognition techniques are unable to satisfactorily perform speech recognition on an audio signal which includes simultaneous, or overlapping, speech. Accordingly, it is desirable to separate the overlapping speech from the audio signal in order to perform speech recognition thereon. This problem is known as speech separation.
According to one type of speech separation, a neural network is trained to receive a mixed speech signal and produce two or more output signals, where each output signal corresponds to only one speaker. For input frames of the mixed speech signal which include no overlapping speech, only one output signal contains speech and the other outputs contain silence. Unfortunately, the order of the speakers in the output signals is arbitrary, necessitating further processing to match the output signals to the speaker identities. Moreover, such approaches do not utilize speaker-specific information, which limits the level of achievable performance.
In another type of speech separation, a neural network receives a mixed speech signal and audio signals of one or more sentences uttered by a target speaker whose speech is to be extracted, and extracts a signal of the target speaker's speech from the mixed speech signal. Accordingly, the identity of the target speaker and the audio signals uttered by the target speaker must be available in advance. Current implementations of this approach are also performance-limited due to a lack of consideration of the speech of competing speakers within the mixed speech signal and/or the coarse usage of the audio signals of one or more sentences uttered by the target speaker. In the latter regard, the uttered audio signals are typically condensed into a single multi-dimensional vector which is used to provide a global bias for extraction of the target speaker's speech from each frame of the input mixed speech signal.
What is needed is a system for improved neural network-based speech separation.
The following description is provided to enable any person in the art to make and use the described embodiments. Various modifications, however, will remain readily apparent to those of ordinary skill in the art.
Generally, some embodiments perform speech extraction using time-varying and context-dependent target speaker biasing. Contrary to the conventional systems mentioned above, embodiments do not reduce the target speaker bias into a single global vector which is used to guide the extraction of the target speaker's speech from all frames of the input mixed speech signal. Rather, a respective target bias vector may be determined for, and based on, each frame of an input mixed signal in order to extract the target speaker's speech from the respective frames.
Some embodiments may also or alternatively utilize time-varying and context-dependent masks of competing speakers to extract the speech of a target speaker. Generally, a trained network may perform better if provided with information relating to speech to be extracted and information relating to speech which should not be extracted, as opposed to providing only one of these inputs. The vectors/masks/bias signals of the target and/or competing speakers may be determined based on a local similarity between the target/competing profile utterances and the input mixed speech signal, thereby allowing each short time segment of the input mixed speech signal to be aligned and accurately compared with the bias signals.
Embodiments may be compatible with voice-enabled home devices and office meeting transcriptions, in which a possible speaker list and speaker profile signals may be available a priori.
The diagrams described herein do not imply a fixed order to the illustrated methods, and embodiments may be practiced in any order that is practicable. Moreover, any of the methods described herein may be performed by hardware, software, or any combination of these approaches. For example, a computer-readable storage medium may store thereon instructions which when executed by a machine result in performance of methods according to any of the embodiments described herein.
As shown, vector generation component 110 generates multi-dimensional vector 120 based on a signal representing mixed speech. The signal may include speech of two or more speakers, with at least two speakers speaking simultaneously for at least a portion of the duration of the signal. For simplicity, it will be assumed that vector 120 represents a particular time duration, or frame, of the mixed speech signal.
Vector 120 may comprise a respective value for each of many dimensions which represent characteristics of a portion of the mixed speech signal. Vector 120 may be generated from the mixed speech signal using any system that is or becomes known. Vector generation unit 110 may, in some embodiments, comprise an artificial neural network which has been trained to generate multi-dimensional vectors representing portions of mixed speech signals. An architecture and training of such a network according to some embodiments are described below.
Profile 130 includes a plurality of vectors, each of which represents speech of a target speaker. Each vector of profile 130 may comprise a respective value for each of many dimensions which represent characteristics of a portion of a target speaker's speech. For example, each vector of profile 130 may represent a different portion of any number of recorded sentences of the target speaker's speech. The vectors of profile 130 may be generated using any system that is or becomes known, including a trained artificial neural network as will be described below.
Similarity weighting unit 140 generates bias vector 150 based on vector 120 and the vectors of profile 130. Generally, similarity weighting unit 140 determines a respective similarity between vector 120 and each vector of profile 130. The vectors of profile 130 are then combined based on their determined similarities to vector 120 in order to generate bias vector 150. Determination of the similarities and combination of the vectors of profile 130 based on the similarities according to some embodiments will be described in detail below.
Bias vector 150 therefore represents portions of the target profile vectors which are similar to the mixed speech signal. Bias vector 150, vector 120 and the mixed speech signal are then input to extraction unit 160 to extract a frame of the target speaker's speech signal from the frame of the mixed speech signal represented by vector 120. Extraction unit 160 may also comprise a trained neural network according to some embodiments. In some embodiments, extraction unit 160 generates a Time-Frequency (TF) mask associated with the target speaker and applies the mask to the frame of the mixed speech signal to generate a frame of the target speaker's speech signal.
The foregoing process repeats for a next frame of the mixed speech signal. However, assuming that vector 120 generated based on the next frame differs from the previously-generated vector 120, similarity weighting unit 140 will determine different similarity weightings for each vector of profile 130. Consequently, similarity weighting unit 140 will also generate a different bias vector 150. Bias vector 150 is therefore time-varying and context-dependent. Accordingly, embodiments may provide better information to extraction unit 160 for extraction of a frame of the target speaker's speech than in some prior systems, in which an extraction unit uses a same bias vector to guide extraction of the target speaker's speech from each frame of the mixed speech signal.
Initially, at S210, a plurality of multi-dimensional vectors are determined. Each of the plurality of multi-dimensional vectors represents a respective frame of a target speaker's speech. According to some embodiments, each of the plurality of multi-dimensional vectors may represent a different portion of any number of recorded sentences of the target speaker's speech. S210 may be performed well in advance of the remaining steps of process 200, in a pre-processing phase in which a plurality of multi-dimensional vectors are determined for each of many speakers.
A multi-dimensional vector representing a frame of a speech signal is determined at S220. The speech signal of S220 includes speech of two or more speakers. The speech signal may, for example, comprise an audio recording of a meeting. The multi-dimensional vector may represent features of the frame of the speech signal.
Next, at S230, a similarity is determined between the vector determined at S220 and each vector determined at S210. According to some embodiments, the multi-dimensional vector determined at S220 may be determined using a same network or algorithm as used to determine each multi-dimensional vector at S210. Such determinations may facilitate the similarity determinations at S230.
A weighted vector representing speech of the target speaker is determined at S240. The weighted vector is determined based on the vectors determined at S210 and on the similarities determined for each of these vectors at S230. For example, it is assumed that a first one of the plurality of vectors determined at S210 is determined to be very similar to the vector determined at S220, and a second one of the plurality of vectors determined at S210 is determined to be less similar to the vector determined at S220. Accordingly, the first vector will be weighted more heavily than the second vector in the determination of the weighted vector at S240. Consequently, the weighted vector will be more similar to the first vector than to the second vector.
At S250, speech of the target speaker is extracted from the frame of speech of the two or more speakers based on the weighted vector. According to some embodiments, the weighted vector and the vector determined at S220 are used at S250 to extract a frame of the target speaker's speech.
It is then determined at S260 whether additional frames of speech of the two or more speakers remain to be processed. If so, flow returns to S220 to determine a multi-dimensional vector representing the next frame of speech of the two or more speakers. New similarities are determined S230 between the next frame and each of the plurality of vectors determined at S210. These new similarities are used in conjunction with each of the plurality of vectors to determine a new weighted vector at S240 which is used at S250 to extract a frame of the target speaker's speech from the next frame of speech.
According to the extraction depicted in
The log-scaled magnitude spectra of the frame of mixed speech and the frames of target speech are input to embedders 315 and 320, respectively. Embodiments may employ any suitable audio representations for input to embedders 315 and 320. Based on the input, embedder 315 and embedder 320 respectively generate embedding vector 325 and embedding vectors 330. As is known in the art, the embedding vectors may be projected from the hidden activations of a last layer of a trained neural network implementing embedders 315 and 320. Each embedding vector may consist of 512 dimensions, but embodiments are not limited to the particular dimensionalities noted herein.
As described with respect to similarity weighting unit 140 of
Attention network 335 computes context-dependent bias vectors B∈RD×T
where dt,i is the inner product of Yt and Xi and measures their similarity, the weight wt,i is the softmax of dt,i over i∈[1, T], and the bias vector at frame t is Bt. Bt is therefore a weighted sum of embedding vectors 330 of the target speaker's profile data.
Mask extraction unit 345 receives vectors 325 and 340 and the mixed speech signal and generates a TF mask of the target speaker's speech based thereon. Mask extraction unit 345 may comprise a trained neural network as will be described below. Generation unit 350 applies the TF mask to the subject frame of the mixed speech signal as is known in the art in order to output an extracted frame of the target speaker's speech.
Generally, model training platform 510 operates to input training data to system 300, determine whether the resulting output of system 300 is sufficiently accurate with respect to ground truth data, modify system 300 if the resulting output of system 300 is not sufficiently accurate, and repeat the process until the resulting output of system 300 is sufficiently accurate.
According to some embodiments, the training data is determined based on speech signals stored in datastore 520. Datastore 520 associates each of a plurality of speakers with one or more pre-captured utterances. The utterances may be audio signals in any format suitable for input to system 300.
In one non-exhaustive example illustrated in
A target profile utterance of one of the randomly-selected speakers is also randomly sampled. The target profile utterance (e.g., Spkr1, Smpl2) is longer than a minimum length (e.g., 10 s) and is different from the utterance used for generating the mixed speech. The mixed speech sample and the target profile utterance comprise a single training sample. A training set may consist of thousands of such training samples, each generated as described above.
As shown, the mixed speech sample and target profile utterance of a training sample are input to model training platform 510 for input to system 300. Loss component 530 then determines a difference between the output of system 300 and the utterance of the target speaker which was used to generate the mixed speech sample (e.g., (Spkr1, Smpl1)). The above is repeated for each training sample of the training set, in order to determine an overall loss. Model training platform 510 then modifies the neural networks of system 300 based on the overall loss as is known in the. The foregoing process may be considered a single training steps, and training may include the execution of thousands of successive steps.
According to some embodiments, the input mixed speech sample and the input target speaker profile each comprise a 257 dimension log-scaled magnitude spectrum. Each analysis frame may be 32 ms long (e.g., 512 samples for a sampling rate of 16 kHz) and successive frames may be shifted by 16 ms. The FFT length is 512.
In some embodiments, the same embedding network is used as embedder 320 and embedder 315 to generate the embedding vectors for the target speaker profile and the mixed speech signal. The embedding network may include two bidirectional Long Short-Term Memory (BLSTM) layers, with 512 cells in each direction of each layer. As is known in the art, the last layer's hidden activations are projected to 512 dimension embedding vectors.
An extraction network implementing mask extraction component 345 may contain two BLSTM layers, each with 512 cells in each direction of each layer. The input dimension of mask extraction component 345 is 257+512*2, and the output comprises TF masks of 257 dimensions.
A neural network (e.g., deep learning, deep convolutional, or recurrent) according to some embodiments comprises a series of “neurons,” such as Long Short-Term Memory (LSTM) nodes, arranged into a network. A neuron is an architecture used in data processing and artificial intelligence, particularly machine learning, that includes memory that may determine when to “remember” and when to “forget” values held in that memory based on the weights of inputs provided to the given neuron. Each of the neurons used herein are configured to accept a predefined number of inputs from other neurons in the network to provide relational and sub-relational outputs for the content of the frames being analyzed. Individual neurons may be chained together and/or organized into tree structures in various configurations of neural networks to provide interactions and relationship learning modeling for how each of the frames in an utterance are related to one another.
For example, an LSTM serving as a neuron includes several gates to handle input vectors, a memory cell, and an output vector. The input gate and output gate control the information flowing into and out of the memory cell, respectively, whereas forget gates optionally remove information from the memory cell based on the inputs from linked cells earlier in the neural network. Weights and bias vectors for the various gates are adjusted over the course of a training phase, and once the training phase is complete, those weights and biases are finalized for normal operation. Neurons and neural networks may be constructed programmatically (e.g., via software instructions) or via specialized hardware linking each neuron to form the neural network.
According to the foregoing embodiments, the TF masks are multiplied by the linear magnitude of the mixed speech to signal to generate the extracted magnitude of the target speaker's speech. This extracted magnitude is compared against the target speaker's source speech used to generate the mixed speech signal in order to determine the training loss. According to some embodiments, the networks are trained with the Adam optimizer jointly to minimize the mean squared error between the extracted magnitude of the target speaker's speech and the magnitude of the target speaker's source speech.
It may be difficult to differentiate the target and competing speakers, particularly when a competing speaker's voice is similar to that of the target speaker. In these situations and others, it may be beneficial to use a competing speaker's profile to assist the extraction process. By providing the extraction network with information representing the target speech and speech which should not be extracted, the extraction network may be better able to extract the target speech.
Similar to the description of
Profile 630 includes a plurality of multi-dimensional vectors, each of which represents a frame of speech of the target speaker. Profile 635 also includes a plurality of multi-dimensional vectors, each of which represents a frame of speech of the known competing speaker.
Similarity weighting unit 640 generates bias vector 650 based on vector 620 and the vectors of profile 530 as described above. Moreover, similarity weighting unit 645 generates bias vector 655 based on vector 620 and the vectors of profile 635 as described above. Accordingly, bias vector 650 represents portions of target profile vectors 530 which are similar to the subject frame of the mixed speech signal and bias vector 655 represents portions of competing speaker profile vectors 635 which are similar to the subject frame of the mixed speech signal.
Vector 620, bias vector 650, bias vector 655 and the mixed speech signal are then input to extraction unit 660 to extract a frame of the target speaker's speech signal from the subject frame of the mixed speech signal. As described above, extraction unit 660 generates a TF mask associated with the target speaker and applies the mask to the frame of the mixed speech signal to generate a frame of the target speaker's speech signal. In some embodiments, extraction unit 660 may also generate a TF mask associated with the competing speaker.
The foregoing process repeats for a next frame of the mixed speech signal. Consequently, similarity weighting unit 640 generates a different bias vector 650 and similarity weighting unit 645 generates a different bias vector 655 corresponding to the vector 620 generated based on the next frame. Bias vectors 650 and 655 are therefore both time-varying and context-dependent.
Prior to process 700, a mixed speech signal is identified from which a target speaker's speech is to be extracted. The target speaker and a competing speaker whose speech is also present in the mixed speech signal are also identified.
A first plurality of multi-dimensional vectors are determined at S710, with each of the first plurality of multi-dimensional vectors representing a respective frame of a target speaker's speech. According to some embodiments, each of the plurality of multi-dimensional vectors may represent a different portion of any number of recorded sentences of the target speaker's speech. A second plurality of multi-dimensional vectors are determined at S720, with each of the second plurality of multi-dimensional vectors representing a respective frame of the competing speaker's speech. According to some embodiments, each of the plurality of multi-dimensional vectors may represent a different portion of any number of recorded sentences of the target speaker's speech. S710 and S720 may be performed during a pre-processing phase in which a plurality of multi-dimensional vectors are determined for each of many speakers.
A multi-dimensional vector representing a frame of a mixed speech signal is determined at S730. The speech signal includes speech of the target speaker, the competing speaker, and zero or more additional speakers. Next, at S740, a similarity is determined between the vector determined at S730 and each of the first plurality of vectors determined at S710. A first weighted vector is determined at S750 based on the first plurality of vectors determined at S710 and on the similarities determined for each of these vectors at S740, as described with respect to S240 and
As depicted in
At S780, speech of the target speaker is extracted from the frame of speech of the two or more speakers based on the first weighted vector and the second weighted vector. According to some embodiments, and as illustrated in
Flow returns to S720 if it is determined at S790 that a next frame of speech of the two or more speakers is to be processed. Re-execution of these steps results in the determination of new similarities at S740 and S760 and resulting new first and second weighted vectors. These new first and second weighted vectors are then used to extract a frame of the target speaker's speech from the next frame of mixed speech.
A frame of a mixed speech signal is input to magnitude spectrum unit 805 to generate a log-scaled magnitude spectrum representing the frame of the mixed speech signal. The mixed speech signal may include speech of a known target speaker and a known competing speaker. Frames of the target speaker's profile are input to magnitude spectrum unit 810 to generate a log-scaled magnitude spectrum for each frame of the target speaker's profile, and frames of the competing speaker's profile are input to magnitude spectrum unit 815 to generate a log-scaled magnitude spectrum for each frame of the competing speaker's profile. As mentioned above, the log-scaled magnitude spectrum of a frame may consist of 257 dimensions.
The log-scaled magnitude spectra of the frame of mixed speech, the frames of the target speaker's profile and the frames of the competing speaker's profile are input to embedders 820, 825 and 830, respectively. Embedder 820 generates an embedding vector representing the mixed speech signal, embedder 825 generates embedding vectors of the target speaker's profile, and embedder 830 generates embedding vectors of the competing speaker's profile.
Attention network 835 generates a target bias vector based on a similarity between the embedding vector generated by embedder 820 and each of the embedding vectors generated by embedder 825. Attention network 835 may generate the bias vector by combining the embedding vectors generated by embedder 825 in a manner which gives more weight to the vectors of the target speaker's frames that are more similar to the current mixed speech frame.
Similarly, attention network 840 generates a competing bias vector based on a similarity between the embedding vector generated by embedder 820 and each of the embedding vectors generated by embedder 830. Attention network 840 may also generate the bias vector by combining the embedding vectors generated by embedder 830 in a manner which gives more weight to the vectors of the competing speaker's frames that are more similar to the current mixed speech frame.
Mask extraction unit 845 receives weighted vectors from attention networks 835 and 840, the embedding vector of generated by embedder 820, and the log-scaled magnitude spectrum of the mixed speech frame. The received vectors may be concatenated in some embodiments. Mask extraction unit 845 generates a TF mask of the target speaker's speech and, in some embodiments and based on the training thereof, may also generate a TF mask of the competing speaker's speech. Generation unit 850 then applies the TF mask to the subject frame of the mixed speech signal to output an extracted frame of the target speaker's speech.
Model training platform 510 may operate to input training data to system 800 of
Profile utterances of each of the two randomly-selected speakers are also randomly sampled. The profile utterances (e.g., (Spkr1, Smpl2), (Spkr2, Smpl2)) are longer than a minimum length (e.g., 10 s) and are different from the utterances used for generating the mixed speech. As shown in
For each input training set, loss component 930 determines a difference between the output of system 800 and the utterance of the target speaker which was used to generate the mixed speech sample (e.g., (Spkr1, Smpl1)). The above is repeated for each training sample of the training set, in order to determine an overall loss.
In some embodiments, the same embedding network is used as embedder 820, embedder 825 and embedder 830 to generate the embedding vectors for the target speaker profile, the mixed speech signal and the competing speaker profile. The input dimension of mask extraction component 845 may therefore be 257+512*3, and the output comprises TF masks of 257 dimensions.
According to some embodiments, only a competing speaker's profile is used to extract speech of a target speaker. For example, a bias vector is generated for each frame of input mixed speech based on vectors representing a competing speaker's profile, and this bias vector is used to extract the target speaker's speech. Such an embodiment may be implemented by a system similar to system 600, but omitting elements 630, 640 and 650, a process similar to process 700 but omitting steps S710, S740 and S750, or a system similar to system 800 but omitting elements 810, 825 and 835.
As shown, transcription service 1010 may be implemented as a cloud service providing transcription of mixed speech audio signals received over cloud 1020. Transcription service 1010 may implement target speaker speech extraction according to some embodiments.
Each of client devices 1030 and 1032 may be operated by a participant in a teleconference managed by cloud teleconference service 1040. Teleconference service 1040 may provide mixed speech audio signals of the meeting and, in some embodiments, a list of meeting participants, to transcription service 1010. Transcription service 1010 may operate according to some embodiments to extract speech of target speakers from the mixed speech signal and perform voice recognition on the extracted signals to generate a transcript of the teleconference. Transcription service 1010 may in turn access transcript storage service 1050 to store the generated transcript. Either of client devices 1030 and 1032 may then access transcript storage service 1050 to request a stored transcript.
System 1100 includes processing unit 1110 operatively coupled to communication device 1120, persistent data storage system 1130, one or more input devices 1140, one or more output devices 1150 and volatile memory 1160. Processing unit 1110 may comprise one or more processors, processing cores, etc. for executing program code. Communication interface 1120 may facilitate communication with external devices, such as client devices, and data providers as described herein. Input device(s) 1140 may comprise, for example, a keyboard, a keypad, a mouse or other pointing device, a microphone, a touch screen, and/or an eye-tracking device. Output device(s) 1150 may comprise, for example, a display (e.g., a display screen), a speaker, and/or a printer.
Data storage system 1130 may comprise any number of appropriate persistent storage devices, including combinations of magnetic storage devices (e.g., magnetic tape, hard disk drives and flash memory), optical storage devices, Read Only Memory (ROM) devices, etc. Memory 1160 may comprise Random Access Memory (RAM), Storage Class Memory (SCM) or any other fast-access memory.
Transcription service 1132 may comprise program code executed by processing unit 1110 to cause system 1100 to receive mixed speech signals and extract one or more target speaker's speech signals therefrom as described herein. Node operator libraries 1134 may comprise program code to execute functions of trained nodes of a neural network to generate TF masks as described herein. Speaker profiles 1136 may include utterances and/or embedding vectors representing profiles of one or more speakers. Data storage device 1130 may also store data and other program code for providing additional functionality and/or which are necessary for operation of system 1100, such as device drivers, operating system files, etc.
Each functional component described herein may be implemented at least in part in computer hardware, in program code and/or in one or more computing systems executing such program code as is known in the art. Such a computing system may include one or more processing units which execute processor-executable program code stored in a memory system.
The foregoing diagrams represent logical architectures for describing processes according to some embodiments, and actual implementations may include more or different components arranged in other manners. Other topologies may be used in conjunction with other embodiments. Moreover, each component or device described herein may be implemented by any number of devices in communication via any number of other public and/or private networks. Two or more of such computing devices may be located remote from one another and may communicate with one another via any known manner of network(s) and/or a dedicated connection. Each component or device may comprise any number of hardware and/or software elements suitable to provide the functions described herein as well as any other functions. For example, any computing device used in an implementation of a system according to some embodiments may include a processor to execute program code such that the computing device operates as described herein.
All systems and processes discussed herein may be embodied in program code stored on one or more non-transitory computer-readable media. Such media may include, for example, a hard disk, a DVD-ROM, a Flash drive, magnetic tape, and solid state Random Access Memory (RAM) or Read Only Memory (ROM) storage units. Embodiments are therefore not limited to any specific combination of hardware and software.
Those in the art will appreciate that various adaptations and modifications of the above-described embodiments can be configured without departing from the claims. Therefore, it is to be understood that the claims may be practiced other than as specifically described herein.
The present application claims the benefit of U.S. Provisional Patent Application No. 62/834,561 filed Apr. 16, 2019, the entire contents of which are incorporated herein by reference for all purposes.
Number | Date | Country | |
---|---|---|---|
62834561 | Apr 2019 | US |