The invention is related to and in the field of audio processing devices and, more specifically, to an apparatus for morphing a human voice, of which some embodiments are relate to training a voice morphing apparatus and some embodiments are used in the field of speech processing.
Recent advances in computing have raised the possibility of realizing many long sought-after voice-control applications. For example, improvements in statistical models, including practical frameworks for effective neural network architectures, have greatly increased the accuracy and reliability of previous speech processing systems. This has been coupled with a rise in wide area computer networks, which offer a range of modular services that can be simply accessed using application programming interfaces. Voice is quickly becoming a viable option for providing a user interface.
However, voice has a disadvantage when compared to text or other graphical input methods, namely that it is often easy to identify a particular speaker from captured speech. In many cases, it may be desired to use voice as an input interface but avoid a possibility of identifying the speaker. For example, a user may wish to make a voice enquiry without being identified and/or tracked. As a comparison, web browsers provide a private browsing or “incognito” mode that limits an amount of personal information that is exchanged with Internet servers. It would be useful to allow a similar mode for voice input. Voice anonymity may also be useful for allowing the exchange of voice data to train large linguistic neural network models. Often supervised learning models require labelled data, which involves manually labelling voice samples. It would be advantageous to anonymize voice data before it is sent to labelers.
Fahimeh Bahmaninezhad et al. in the paper “Convolutional Neural Network Based Speaker De-Identification” presented at Odyssey 2018, The Speaker and Language Recognition Workshop in Les Sables d'Olonne, France (the contents of which are incorporated herein by reference), describe a method of concealing speaker identity in speech signals. The proposed speaker de-identification system maps a voice of a given speaker to an average (or gender-dependent average) voice. The mapping is modeled by a new convolutional neural network (CNN) encoder-decoder architecture. The method is tested on the voice conversion challenge 2016 (VCC-2016) database.
Providing speaker de-identification and voice anonymity is difficult. Many existing systems seek to map a source speaker onto a target speaker, or an average of target speakers. However, it is easy for trained neural network systems to produce unintelligible or heavily distorted outputs that destroy the information carried in the voice signal. Additionally, comparative systems such as that proposed by Fahimeh Bahmaninezhad map distinctive characteristics of input speech to different but still distinctive characteristics in the output speech, allowing some form of identification. It is also difficult to de-identify a speaker yet maintain non-identifying characteristics of speech audio such as noise, gender and accent.
Therefore, what is needed are systems and methods for voice modification to allow for user anonymity and privacy.
Aspects and embodiments of the invention are set out in the independent claim(s).
In accordance with various aspects and embodiments of the present invention, there is provided a method of training a voice morphing apparatus. The method includes evaluating an objective function for a plurality of data samples, each data sample including an input for the voice morphing apparatus, the objective function being defined as a function of at least an output of the voice morphing apparatus, the objective function including: a first term based on speaker identification, the first term modifying the objective function proportional to a measure of speaker identification based on at least the output of the voice morphing apparatus; and a second term based on audio fidelity of at least the output of the voice morphing apparatus, the second term modifying the objective function proportional to a measure of audio fidelity between the output and the input of the voice morphing apparatus. The method further includes adjusting parameters of the voice morphing apparatus based on the evaluating.
In accordance with some aspects and embodiments of the invention, by training a voice morphing apparatus using input audio data, e.g. unlabeled voice samples, and terms that are in opposition, a certainty of speaker identification may be reduced, effectively masking a speaker's identity while maintaining an audio fidelity, e.g. maintaining audio data that sounds like speech and may be processed by conventional speech processing systems. The objective function may include a loss function, in which case the first term may increase the loss based on a certainty or confidence of speaker identification and the second term may decrease the loss based on a similarity of the input and output.
In accordance with various aspects, the voice morphing apparatus includes an artificial neural network architecture and adjusting parameters of the voice morphing apparatus includes applying a gradient descent method to a derivative of the objective function with respect to the parameters of the artificial neural network architecture. These aspects may thus be implemented using standardized neural network software libraries that provide for custom loss functions.
In accordance with various aspects, the second term is computed using an output of an audio processing component of an automatic speech recognition system. The audio processing component may be used to compute a speaker intelligibility measure for the second term, e.g. by computing a first phoneme recognition score for the input to the voice morphing apparatus using the audio processing component; computing a second phoneme recognition score for the output from the voice morphing apparatus using the audio processing component; and computing the second term for the objective function based on a comparison between the first and second phoneme recognition scores. Re-using existing components of an automatic speech recognition system may allow for easy implementation and also ensures that the voice morphing apparatus is trained consistently with speech processing functions that may be applied to an output of the apparatus. In this case, it may be ensured that the voice morphing apparatus does not overly degrade the accuracy of acoustic models that may be applied to morphed voices.
In accordance with various aspects, the method comprises comparing a spectrogram for the input to the voice morphing apparatus and a spectrogram for the output of the voice morphing apparatus; and computing the second term for the objective function based on the comparison. This may ensure that audio features are suitably conserved despite the voice being morphed, e.g. such that the audio still sounds “voice-like” and maintains similar-sounding transient and constant noise.
In accordance with various aspects, the first term is computed using an output of a speaker identification component of an automatic speech recognition system. The first term is based on a certainty score output by the speaker identification component. In certain cases, the first term may be computed by computing a first speaker identification vector for the input to the voice morphing apparatus using the speaker identification component; computing a second speaker identification vector for the output from the voice morphing apparatus using the speaker identification component; and comparing the first and second speaker identification vectors. Again, using existing speech processing components reduces the implementational complexity. Comparing an output of parallel speaker identification processes may provide one way of measuring a change in speaker identification ability.
In accordance with various aspects, the objective function comprises one or more further terms based on one or more of: a gender classification using at least the output of the voice morphing apparatus; and an accent classification using at least the output of the voice morphing apparatus, wherein the one or more further terms are weighted to either maintain or move away from one or more of a gender classification and an accent classification. In one aspect, one of more classifiers may be used to determine one or more further terms that allow for certain characteristics of a voice to be maintained despite a masking of the speaker identify. For example, applying gender and accent classifiers may allow for gender and accent to be maintained. In certain aspects the one or more further terms are based on a comparative score between a classification applied to the input of the voice morphing apparatus and a classification applied to the output of the voice morphing apparatus and input data is pre-selected to provide a defined distribution of voice characteristics.
In accordance with one aspect, there is provided a system for training a voice morphing apparatus, the system comprising a voice morphing apparatus configured to evaluate an objective function for a plurality of data samples, each data sample comprising an input for the voice morphing apparatus, the objective function being defined as a function of at least an output of the voice morphing apparatus. The objective function comprises a first term based on speaker identification, the first term modifying the objective function proportional to a measure of speaker identification based on at least the output of the voice morphing apparatus and a second term based on audio fidelity of at least the output of the voice morphing apparatus, the second term modifying the objective function proportional to a measure of audio fidelity between the output and the input of the voice morphing apparatus. The system being further configured to adjust the parameters based on the evaluating.
The voice morphing apparatus may comprise an artificial neural network architecture. The system (for example an objective function evaluator) may adjust the parameters by applying a gradient descent method to a derivative of the objective function with respect to the parameters of the artificial neural network architecture.
The system may further comprise an automatic speech recognition system comprising an audio processing component. The system may compute the second term using an output of the audio processing component. The system may compute a speaker intelligibility measure for the second term using the audio processing component.
The audio processing component may compute a first phoneme recognition score for the input to the voice morphing apparatus and a second phoneme recognition score for the output from the voice morphing apparatus. The system may compute the second term for the objective function based on a comparison between the first and second phoneme recognition scores.
The system may compare a spectrogram for the input to the voice morphing apparatus and a spectrogram for the output of the voice morphing apparatus and compute the second term for the objective function based on the comparison.
The system may comprise a speaker identification component. The system may compute the first term using an output of a speaker identification component. The speaker identification component may output a certainty score. The first term may be based on the certainty score output by the speaker identification component.
The speaker identification component may be used to compute a first speaker identification vector for the input to the voice morphing apparatus. The speaker identification component may be sued to compute a second speaker identification vector for the output from the voice morphing apparatus. The system may compute the first term for the objective function based on a comparison between the first and second speaker identification vectors.
The voice morphing apparatus may be configured to evaluate the objective function further comprising a gender classification using at least the output of the voice morphing apparatus and an accent classification using at least the output of the voice morphing apparatus, wherein the one or more further terms are weighted to either maintain or move away from one or more of a gender classification and an accent classification.
The system may apply a classification to the input of the voice morphing apparatus. The system may apply a classification to the output of the voice morphing apparatus. The one or more further terms may be based on a comparative score between the classification applied to the input of the voice morphing apparatus and the classification applied to the output of the voice morphing apparatus.
The system may pre-select input data to provide a defined distribution of voice characteristics.
In accordance with another aspect, a system for training a voice morphing apparatus is provided. The system comprises a voice morphing apparatus comprising a set of trainable parameters, the voice morphing apparatus being configured to map input audio data to output audio data; a speaker identification system configured to output speaker identification data based on input audio data; and an audio fidelity system configurated to output audio fidelity data. The system is configured to pass at least output audio data for the voice morphing apparatus to the speaker identification system and the audio fidelity system, wherein the system is configured to train the voice morphing apparatus using at least a set of input audio data, and wherein an output of the speaker identification system and an output of the audio fidelity system are used by the system to adjust the set of trainable parameters.
This system may provide benefits similar to the above-mentioned method. The voice morphing apparatus may comprise an artificial neural network architecture.
In accordance with various aspects, the speaker identification system is configured to output a score indicative of a confidence of identification for one or more speakers, and wherein the system is configured to evaluate an objective function with a first term based on the score indicative of a confidence of identification, the objective function causing the system to adjust the set of trainable parameters to reduce the score. The speaker identification system may comprise a speaker identification component and the system may be configured to train the voice morphing apparatus to maximize a difference between outputs of the speaker identification component for the input audio data and the output audio data of the voice morphing apparatus. Speaker identification systems may be configured to output confidence or probability data as part of a prediction; this data may thus be re-used to train the voice morphing apparatus.
In accordance with various aspects, the audio fidelity system comprises a speaker intelligibility component, the speaker intelligibility component comprising a speech processing component. The speaker intelligibility component may comprise a phoneme recognition component and the audio fidelity system may be configured to output a measure of similarity based on a difference between outputs of the phoneme recognition component for the input audio data and the output audio data of the voice morphing apparatus, wherein the system is configured to train the voice morphing apparatus to minimize said difference. In this case, existing front-end components of an automatic speech recognition system may be re-purposed to train the voice morphing apparatus to maintain an intelligibility of morphed speech. The audio fidelity system may further comprise an audio similarity component configured to compare the input audio data and the output audio data of the voice morphing apparatus, wherein the audio fidelity system may be configured to output a measure of similarity based on an output of the audio similarity component, the system being configured to train the voice morphing apparatus to maximize an output of the audio similarity component for the input audio data and the output audio data. The audio similarity component may be configured to generate a score indicative of a spectrogram similarity. This may help train the voice morphing apparatus to morph speech in a manner that retains speech or voice-like audio characteristics, despite a masking of the speaker identity.
In accordance with various aspects, the system comprises one or more voice feature classifiers, wherein the system is configured to apply the one or more voice feature classifiers to at least the output audio data for the voice morphing apparatus and to use an output of the one or more voice feature classifiers to adjust the set of trainable parameters for the voice morphing apparatus. These voice feature classifiers may be used as part of an objective or loss function for the training of the voice morphing apparatus to retain or discard (depending on configuration) certain aspects of speech such as gender or accent. The system may be configured to compare outputs of the one or more voice feature classifiers for the input audio data and the output audio data of the voice morphing apparatus and to use an output of the comparison to adjust the set of trainable parameters for the voice morphing apparatus.
In accordance with another aspect, a method of training a voice morphing apparatus is provided. The method comprises mapping, by a voice morphing apparatus comprising a set of trainable parameters, input audio data to output audio data, outputting, by a speaker identification system, speaker identification data based on input audio data, and outputting, by an audio fidelity system, an audio fidelity data, passing at least output audio data for the voice morphing apparatus to the speaker identification system and the audio fidelity system, training the voice morphing apparatus using at least a set of input audio data, and using an output of the speaker identification system and an output of the audio fidelity system to adjust the set of trainable parameters.
The method may comprise outputting a score indicative of a confidence of identification for one or more speakers, and evaluating an objective function with a first term based on the score indicative of a confidence of identification, and adjusting, using the objective function, the set of trainable parameters to reduce the score.
The speaker identification system may comprise a speaker identification component. The method may comprise training the voice morphing apparatus to maximize a difference between outputs of the speaker identification component for the input audio data and the output audio data of the voice morphing apparatus.
The audio fidelity system may comprise a speaker intelligibility component, the speaker intelligibility component may comprise a speech processing component. The speaker intelligibility component may comprise a phoneme recognition component.
The method may further comprise outputting, by the audio fidelity system, a measure of similarity based on a difference between outputs of the phoneme recognition component for the input audio data and the output audio data of the voice morphing apparatus, and training the voice morphing apparatus to minimize said difference.
The audio fidelity system may comprise an audio similarity component. The method may further comprise comparing, by the audio similarity component, the input audio data and the output audio data of the voice morphing apparatus, outputting, by the audio fidelity system, a measure of similarity based on an output of the audio similarity component and training the voice morphing apparatus to maximize an output of the audio similarity component for the input audio data and the output audio data.
The method may further comprise generating, by the audio similarity component, a score indicative of a spectrogram similarity.
The method may further comprise applying one or more voice feature classifiers to at least the output audio data for the voice morphing apparatus and using an output of the one or more voice feature classifiers to adjust the set of trainable parameters for the voice morphing apparatus. The method may further comprise comparing outputs of the one or more voice feature classifiers for the input audio data and the output audio data of the voice morphing apparatus and using an output of the comparison to adjust the set of trainable parameters for the voice morphing apparatus.
In accordance with another aspect, a voice morphing apparatus is provided. The voice morphing apparatus may comprise a neural network architecture to map input audio data to output audio data, the input audio data comprising a representation of speech from a speaker, the neural network architecture comprising a set of parameters, the set of parameters being trained to reduce a speaker identification score from the input audio data to the output audio data and to optimize a speaker intelligibility score for the output audio data.
The voice morphing apparatus of this aspect may be used to morph speech in a manner that hides or masks a speaker identity. This may be useful for anonymizing speech data and/or for providing private voice queries.
In accordance with various aspects, the voice morphing apparatus may comprise a noise filter to pre-process the input audio data, wherein the noise filter is configured to remove a noise component from the input audio data and the voice morphing apparatus is configured to add the noise component to output audio data from the neural network architecture. This may enable noise to be isolated from the system to increase a stability of training and/or preserve noise features of the audio data for use as a subsequent speech data training set.
In accordance with various aspects, the neural network architecture comprises one or more recurrent connections. For example, an output of the neural network architecture may be fed back as an input for future outputs, e.g. may form part of an input for a later time step.
In certain aspects, the voice morphing apparatus may be configured to output time-series audio waveform data based on the output audio data from the neural network architecture. In one case, the voice morphing apparatus may directly output time series audio data; in another case, the voice morphing apparatus may output spectrogram data that may be converted to time series audio data.
In an aspect, a method for using a voice morphing apparatus is provided. The method comprises mapping, via a neural network architecture, input audio data to output audio data, the input audio data comprising a representation of speech from a speaker, the neural network architecture comprising a set of parameters and training the set of parameters to reduce a speaker identification score from the input audio data to the output audio data and to optimize a speaker intelligibility score for the output audio data.
The method may further comprise pre-processing the input audio data with a noise filter. The method may further comprise removing, by the noise filter, a noise component from the input audio data and adding, by the voice morphing apparatus, the noise component to output audio data from the neural network architecture. The neural network architecture may comprise one or more recurrent connections.
The method may further comprise outputting, by the voice morphing apparatus, time-series audio waveform data based on the output audio data from the neural network architecture.
According to another aspect, a non-transitory computer-readable storage medium may be provided that stores instructions which, when executed by at least one processor, cause the at least one processor to: load input audio data from a data source; input the input audio data to a voice morphing apparatus, the voice morphing apparatus comprising a set of trainable parameters; process the input audio data using the voice morphing apparatus to generate morphed audio data; apply a speaker identification system to at least the morphed audio data to output a measure of speaker identification; apply an audio fidelity system to the morphed audio data and the input audio data to output a measure of audio fidelity; evaluate an objective function based on the measure of speaker identification and the measure of audio fidelity; and adjust the set of trainable parameters for the voice morphing apparatus based on a gradient of the objective function, wherein the objective function is configured to adjust the set of trainable parameters to optimize the measure of audio fidelity between the morphed audio data and the input audio data and to modify the measure of speaker identification.
According to another aspect, there is provided a method for training a voice morphing apparatus. The method comprises loading input audio data from a data source, inputting the input audio data to the voice morphing apparatus, the voice morphing apparatus comprising a set of trainable parameters, processing the input audio data using the voice morphing apparatus to generate morphed audio data, applying a speaker identification system to at least the morphed audio data to output a measure of speaker identification, applying an audio fidelity system to the morphed audio data and the input audio data to output a measure of audio fidelity, evaluating an objective function based on the measure of speaker identification and the measure of audio fidelity; and adjusting the set of trainable parameters for the voice morphing apparatus based on a gradient of the objective function, wherein the objective function is configured to adjust the set of trainable parameters to optimize the measure of audio fidelity between the morphed audio data and the input audio data and to modify the measure of speaker identification.
The following describes various embodiments of the present technology that illustrate various interesting aspects. Generally, embodiments can use the described aspects in any combination. All statements herein reciting principles, aspects, and embodiments are intended to encompass both structural and functional equivalents thereof. Additionally, it is intended that such equivalents include both currently known equivalents and equivalents developed in the future, i.e., any elements developed that perform the same function, regardless of structure.
It is noted that, as used herein, the singular forms “a,” “an” and “the” include plural referents unless the context clearly dictates otherwise. Reference throughout this specification to “one,” “an,” “certain,” “various,” and “cases”, “embodiments” or similar language means that a particular aspect, feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment. Thus, appearances of the phrases “in one case,” “in at least one embodiment,” “in an embodiment,” “in certain cases,” and similar language throughout this specification may, but do not necessarily, all refer to the same embodiment or similar embodiments. Furthermore, aspects and embodiments described herein are merely by way of example, and should not be construed as limiting of the scope or spirit of the invention as appreciated by those of ordinary skill in the art. The invention is effectively made or used in any embodiment that includes any novel aspect described herein. Furthermore, to the extent that the terms “including”, “includes”, “having”, “has”, “with”, or variants thereof are used in either the detailed description and the claims, such terms are intended to be inclusive in a similar manner to the term “comprising.” In embodiments showing multiple similar elements, such as storage devices, even if using separate reference numerals, some such embodiments may work with a single element filling the role of the multiple similar elements.
Certain embodiments described herein relate to training a voice morphing apparatus. A voice morphing apparatus comprises a device that takes input audio data and generates modified output audio data. The audio data may comprise raw waveforms, e.g. one or more channels of pressure or microphone membrane displacement measurements over time, and/or processed audio data, including frequency measurements and spectrograms. The voice morphing apparatus may operate upon a series of time steps to generate output audio data with a plurality of samples over time. In one case, the input audio data and the output audio data may have a common time base, e.g. a sample of output audio data is generated for every sample of input audio data. In certain cases, the voice morphing apparatus may be configured to generate an output waveform that may be played as a sound recording; in other cases, a further component may take output audio from the voice morphing apparatus, e.g. in the form of frequency or spectrogram samples, and generate an output waveform that may be rendered. The voice morphing apparatus may be applied online (e.g. to real-time speech capture) and/or offline (e.g. to batches of pre-recorded speech segments). In certain cases, the voice morphing apparatus may be configured to use the output audio data to replace the input audio data, e.g. modify an audio file in-place.
In embodiments described herein the voice morphing apparatus is configured to modify input audio data to morph a voice present in the audio data. Morphing a voice may comprise changing one or more aural characteristics of the voice. In embodiments described herein, the voice is morphed to hide an identity of a speaker, e.g. such that a particular voice audible in the output audio data is not distinguishable as the same voice audible in the input audio data. The audio data is processed by the voice morphing apparatus such that speech is minimally distorted by the morphing, e.g. such that a person and/or an automatic speech recognition system may still successfully process the speech despite a morphed voice.
In
The speaker identification system 210 is configured to process at least the output audio data 130 to determine a measure of speaker identification. This measure of speaker identification may comprise one or more confidence values. In one case, the measure of speaker identification may comprise a probability indicating whether the speaker identification system 210 can successfully identify a speaker. For example, a value of 0.5 may indicate that the speaker identification system 210 has a confidence of 50% in an identification of a speaker featured in the output audio data 130. Or put another way, a value of 0.5 may indicate that a highest probability for a speaker classification (e.g. a maximum likelihood value) is 50%, e.g. the most likely speaker is speaker X who has a probability value of 50%. Different methods may be used to generate the measure of speaker identification as long as the measure is output within a predefined range (e.g. a normalized range of 0 to 1 or an 8-bit integer value between 0 and 255). The output of the speaker identification system 210 may comprise a normalized scalar value. In one case, the speaker identification system 210 may apply a hierarchical identification, e.g. perform a first identification to determine a set of speakers and then perform a second identification to determine a speaker within the determined set. In this case, the measure of speaker identification may comprise a probability from the second identification or an aggregate value (e.g. an average) across the set of hierarchical stages.
The audio fidelity system 220, in the embodiment of
In
As shown in
By applying the components and systems shown in
In certain embodiments, the training system 140 may be implemented using machine learning libraries such as TensorFlow or PyTorch. These libraries provide interfaces for defining neural network architectures and for performing training. These libraries allow for custom loss definitions and these may be used to implement the custom objective functions described herein. In these cases, a derivative of the objective function may be determined automatically using the methods of the libraries, e.g. by using the chain rule and automatic differentiation along a compute graph.
In certain embodiments, one or more of the speaker identification system 210 and the audio fidelity system 220 may comprise existing components of an automatic speech recognition system.
The speaker identification system 210 may comprise a component or module in a speech processing pipeline that identifies a speaker. The speaker identification system 210 may comprise a Hidden Markov Model and/or Gaussian Mixture Model system for speaker identification or a neural network architecture for speaker identification, e.g. such as a system based on x-vectors as described in the paper by Snyder, David, et al. “X-vectors: Robust DNN embeddings for speaker recognition.” 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018 (the contents of which are incorporated herein by reference). In the case that the speaker identification system 210 comprises a neural network architecture, the parameters of the speaker identification system 210 may be fixed when training the voice morphing apparatus 110 (i.e. the parameters of the speaker identification system 210 are not trained when training the voice morphing apparatus 110).
The audio fidelity system 220 may also comprise one or more audio processing components or modules of an automatic speech recognition system. In one case, the audio fidelity system 220 may comprise a phoneme recognition system or acoustic model. This may again be a probabilistic model or a neural network architecture. In one case, the audio fidelity system 220 may comprise an acoustic model that receives at least the output audio data 130 and determines a confidence or probability vector for a set of available phones, phonemes and/or graphemes. Like the speaker identification system 210 described above, an output of the audio fidelity system 220 may comprise a function of this confidence or probability vector. However, unlike the output of the speaker identification system 210, in this case it is desired to maximize the values of the confidence or probability vector, e.g. to have a strong positive identification of linguistic features such as phonemes within the output audio data 130. As above, in the case that the audio fidelity system 220 comprises one or more neural network architectures, the parameters of the audio fidelity system 220 may be fixed when training the voice morphing apparatus 110 (i.e. the parameters of the audio fidelity system 220 are not trained when training the voice morphing apparatus 110). As the parameters of the two systems are fixed, they may be treated as constants in any automatic differentiation of the objective function.
The present embodiments thus provide for a form of adversarial training of the voice morphing apparatus 110 using existing components of an automatic speech recognition system or related speech processing technologies. This makes the training system 140 easy to implement, as existing computer program code and/or hardware devices may be applied in a modular manner to build the training system 140 and output data for use in evaluating an objective function for the voice morphing apparatus 110. One or more of the speaker identification system 210 and the audio fidelity system 220 may comprise front-end components of an automatic speech recognition system, such that a full speech processing pipeline does not need to be applied to train the voice morphing apparatus 110.
As in
Those skilled in the art will understand that there may be many different ways to construct an objective or loss function with comparative functionality. For example, the comparator 320 may output the speaker identification score SID as an inverse of a distance measure between speaker identification probability vectors, in which case a positive weight may be applied such that minimizing this term maximizes the distance. The scores may be determined per time samples or may be averaged over a plurality of time samples.
In one case, weights for each score may be predetermined, e.g. so as to give more importance to one or more of the scores. In one case, the scores and/or the weight may be normalized, e.g. such that the weights sum to one and the scores are a value between 0 and 1. In other cases, the weights may comprise parameters that are optimized as part of the training. In yet other cases, the weights may be dynamic and change based on the scores and/or other information associated with the input audio data 120.
In the embodiments, including those of
In
In one case, different classifiers may be added or removed in a modular manner to configure the voice morphing apparatus 110 and/or to generate different instances of the voice morphing apparatus 110 that preserve or change different characteristics. In one case, for each feature that is to be changed (“flipped”), a term may be added to a loss function such that, when the loss function is minimized, the difference between a classifier for the feature applied to the input audio data and a classifier for the feature applied to the output audio data is maximized. For example, this may be achieved by using an inverse of the difference between the classifiers for the feature in the loss function.
The embodiment of
In the embodiment of
In certain cases, the voice morphing apparatus described herein may be based on a so-called neural vocoder, i.e. a neural network architecture comprising encoder and decoder components. In certain cases, the neural network architectures may only implement a “vocoder decoder” part of a traditional vocoder, e.g. that maps processed audio features into output audio data that may comprise a time-series waveform. When comparing with a traditional vocoder, the “vocoder encoder” part of the neural vocoder may not need to be implemented using a neural network architecture, but instead may be implemented using conventional audio signal processing operations (e.g. the Fast Fourier Transform—FFT—and/or filter banks, taking the magnitude and/or logarithm). In this case, the “vocoder encoder” part of the neural vocoder may not be “neural” but may comprise the audio pre-processing operations described herein. Only the “vocoder decoder” portion of these architectures may comprise a neural network architecture with a set of trainable parameters.
It should also be noted that the neural network architecture may comprise a neural encoder-decoder (e.g. autoencoder-like) architecture as considered from the neural network perspective. This may or may not map onto the traditional encoder-decoder portions of a traditional (non-neural) vocoder. For example, a “vocoder decoder” portion of a vocoder may be implemented using a neural encoder-decoder architecture.
The neural vocoder may comprise one or more recurrent connections. These may not be needed in all embodiments, e.g. convolutional neural network architectures may alternatively use a plurality of frames of audio data including frames before a current frame and frames ahead of a current frame. These approaches may be able to use a sliding window so as to avoid slower recurrent connections (such as found within recurrent neural networks). In one case, the voice morphing apparatus is configured to receive time-series audio waveform data and output time-series audio waveform data; in other cases, the audio data may comprise frequency or Mel features as described. The neural vocoder may comprise one or more convolutional neural network layers and/or one or more feedforward neural network layers. Embodiments of suitable neural vocoder architectures that may be used as a basis for the voice morphing apparatus 110 include those described in “Efficient Neural Audio Synthesis” by Kalchbrenner et al. (published via arXiv on 25 Jun. 2018), “Waveglow: A Flow-Based Generative Network For Speech Synthesis” by Prenger et al. (published via arXiv on 31 Oct. 2018) and “Towards Achieving Robust Universal Neural Vocoding” by Lorenzo-Trueba at al. (published via arXiv on 4 Jul. 2019), all of which are incorporated herein by reference.
In certain embodiments, the plurality of input audio data 120 is pre-selected to provide a defined distribution of voice characteristics. For example, it may be beneficial to train the voice morphing apparatus described herein on a large data set of voice recordings that feature a diverse range of voices. It may also be recommended to use a large data set of diverse voice content, e.g. a plurality of different phrases as opposed to many different voices repeating a common phrase (such as a wake word).
In certain embodiments, a large range of training samples (e.g. for use as input audio data 120) may be generated or augmented using parametric speech synthesis. In this case, speech samples may be generated by selecting the parameters of the speech synthesis system. For example, a training set may be generated by creating random (or pseudo random) text segments and then using a text-to-speech system to convert the text to audio data. In this case, the parameters of the text-to-speech system may also be randomly sampled (e.g. random or pseudo random selections using inbuilt software library and/or hardware functions) to generate a diverse set of training samples. For example, to ensure diversity, an array of speech synthesis parameter sets can be learned that is able to create speech from text, where the speech has an even distribution of vectors matching a range defined by vectors computed from speech from a broad range of human voices within an embedding space.
In certain cases, a speaker identification system may itself by trained on a database of audio data from a plurality of different speakers. The speakers that are used to train the speaker identification system may affect the training of the voice morphing apparatus (e.g. when the parameters of the speaker identification system are fixed and are used to train the apparatus in an adversarial manner). For example, in one case, the training method described herein may act to modify the input audio data so as to change a distribution of features that are used for speaker identification, e.g. as may be present in one or more hidden or output layers of a neural speaker identification system.
Certain embodiments described herein differ from comparative approaches that attempt to map speaker features present in input audio data to either another target speaker or an average of a set of target speakers. These comparative approaches suffer from issues, such as instead of anonymizing a voice, they instead assign the voice to another speaker. This may lead to its own privacy issues. In certain embodiments described herein, however, the voice morphing apparatus is trained to repel speaker features present in the input audio from known speaker identification speakers, effectively making it difficult to determine an identity as opposed to swapping an identity. This may be shown in the example chart 1130 of
In certain embodiments, to optimize the parameters of the voice morphing apparatus such that they de-identify a voice in a manner suitable for human listeners, it may be preferred that the speaker identification system is optimized such that a profile of their relative accuracy across training voices is as close as possible to a profile of human listeners' relative accuracy across the same voices. Hence, when trying to minimize a speaker identification certainty, the voice morphing apparatus will learn to modify the voice in the input audio data in a manner that minimizes the change in audio features but that maximizes confusion for human beings. It is preferred to have a large diverse set of voice characteristics such that the voice morphing apparatus may make minimal changes to the input audio data. For example, if the speaker identification is trained using a plurality of people with a thick accent, it may learn to adjust the voice within the feature space of the thick accent but in a manner that results in a voice with a thick accent that is not identifiable.
In certain cases, it may be possible to train the voice morphing apparatus using audio data from a single speaker. In this case, a speaker identification system may be trained on many speakers (which may include the speaker). However, improved morphing characteristics may be present when the voice morphing apparatus is trained using audio data from multiple speakers that are distributed evenly in voice feature space. Multiple speakers may work to reduce noise and randomness (e.g. jumps in the gradient) when training and improve convergence. In one case, mini-batches may be used to average out differences across multiple speakers and/or normalization may be applied. One form of normalization may use speaker embeddings. For example, a training set may indicate a speaker identification (e.g. an ID number) that may be used to retrieve an embedding (i.e. a vector of values) that represents the speaker. The speaker embeddings may be trained with the whole system (and/or components of the system). If speaker embeddings are provided as an input during training, the voice morphing apparatus may be able to use this information to learn to normalize voices without averaging out specific information about different regions of voice feature space.
At block 1205, the method 1200 comprises evaluating an objective function for a plurality of data samples. Each data sample may be used to generate an input-output pair, e.g. based on input audio data training samples, where the output audio data is generated using the voice morphing apparatus. The objective function is defined as a function of at least an output of the voice morphing apparatus, where this output is generated based on a corresponding input, e.g. as received as a training sample. The objective function may comprise a loss function applied to each training sample, where the loss function is to be minimized. In other embodiments, the objective function may comprise a function to be optimized, e.g. by locating an extremum such as a minimum or maximum.
The objective function comprises a first term based on speaker identification and a second term based on audio fidelity. For example, the first term may be based on a measure of speaker identification determined using at least the output of the voice morphing apparatus. For example, this measure of speaker identification may comprise the output of the one of the speaker identification systems 210, 310 or 710. It may be computed using an output of a speaker identification component and may comprise a certainty or confidence score. The first term modifies the objective function in proportion to the measure of speaker identification, e.g. may increase a value of a loss function to be minimized as a certainty or confidence of identification increases or may decrease a value of an objective function to be maximized. If the measure of speaker identification comprises an identification distance, e.g. a measure of a difference between a speaker probability vector determined based on the input audio data and a speaker probability vector determined based on the output audio data, then the first term may decrease a value of a loss function in proportion to this distance (such that the loss function is minimized as the distance is maximized).
The second term modifies the objective function proportional to a measure of audio fidelity between the output and the input. In certain cases, this may be based on both the input and the output; in other cases, it may be based on the output alone. The measure of audio fidelity may be a measure output by one or more of the components 220, 410, 510, 720 and 810 to 830. If the measure of audio fidelity comprises a distance measure, then an objective function to be minimized may be modified proportional to this measure (such that the objective function is minimized as the distance is minimized); if the measure of audio fidelity comprise a linguistic feature recognition score or probability, then an objective function to be minimized may be modified proportional to an inverse or negatively weighted version of this measure (such that the loss function is minimized as the linguistic feature recognition score is maximized). The term “proportional” is used in the embodiments herein in a broad sense to mean “based on”, “in accordance with” or “as a function of”. In the objective function itself, terms may be based on positive and/or negative weights, and/or may be modified using inverse computations depending on the measures that are used. The term “measure” is also used broadly herein to cover one or more of continuous values, discrete values, scalars, vectors (and other multidimensional measures), categorical values, and binary values (amongst others).
At block 1210, the evaluating at block 1205 is used to adjust parameters of the voice morphing apparatus. For example, if the voice morphing apparatus comprises an artificial neural network architecture, then adjusting parameters of the voice morphing apparatus comprises applying a gradient descent method to a derivative of the objective function with respect to the parameters of the artificial neural network architecture. The dashed line in
In certain embodiments, obtaining an audio fidelity score at block 1320, or evaluating the objective function at block 1205, may comprise computing a first phoneme recognition score for the input to the voice morphing apparatus using an audio processing component and computing a second phoneme recognition score for the output from the voice morphing apparatus using the audio processing component. The second term of the objective function, or the audio fidelity score, may be evaluated based on a comparison between the first and second phoneme recognition scores, e.g. representing a phoneme recognition distance. For example, this is also demonstrated in the embodiment of
In certain embodiments, obtaining an audio fidelity score at block 1320, or evaluating the objective function at block 1205, may alternatively or additionally comprise comparing a spectrogram for the input to the voice morphing apparatus and a spectrogram for the output of the voice morphing apparatus. In this case, the second term of the objective function, or the audio fidelity score, may be evaluated based on the comparison. For example, this is also demonstrated in the embodiment of
In certain embodiments, obtaining a speaker identification score at block 1315, or evaluating the objective function at block 1205, may comprise computing a first speaker identification vector for the input to the voice morphing apparatus using a speaker identification component and computing a second speaker identification vector for the output from the voice morphing apparatus using the speaker identification component. The first term of the objective function, or the speaker identification score, may be evaluated based on a distance between the first and second speaker identification vectors, e.g. representing a speaker identification distance. For example, this is also demonstrated in the embodiment of
In certain embodiments, the objective function evaluated at block 1205 of the method 1200 comprises one or more further terms based on one or more of a gender classification using at least the output of the voice morphing apparatus and an accent classification using at least the output of the voice morphing apparatus, wherein the one or more further terms are weighted to either maintain or move away from one or more of a gender classification and an accent classification. For example, this may comprise modifying the method 1300 of
In these methods, an objective function, such as a loss function, may combine a speaker identification certainty measure with an inverse of an audio fidelity distance. The combination of two or more terms may be a weighted sum of each term. In certain cases, the weights may also be learned during training as a trainable parameter of the voice morphing apparatus. In certain cases, the weights may be dynamic, and may change based on values of one or more of the terms. For example, in one case the weights within the loss function may be applied as a form of attention layer during training. The speaker identification score or measure may be a vector. In certain cases, each element of this vector may relate to a different speaker identification feature and/or a different speaker to be identified. The audio fidelity score or measure may also comprise a vector. In certain cases, each element of this vector may relate to a frequency band, Mel feature and/or other audio feature. In these cases, the measures of speaker identification and/or audio fidelity may be distance measures within the multi-dimensional space of the vectors.
It should be noted that in embodiments described herein, the speaker identification measure or data and the audio fidelity measure or data may comprise one or more of continuous and discrete representations. For example, using a logit or probability output from a speaker identification system or an audio fidelity component may provide for a relatively continuous representation (within the limits of the precision of the number representation), which may result in a smooth and continuous loss function that may facilitate training. In other cases, however, the voice morphing apparatus may be trained as part of a generative adversarial network (GAN) and/or using a game-theory based algorithm. In these latter cases, discrete representations such as categorical data may be used as the measure or data. For example, the measure may be a speaker ID and/or a binary measure indicating successful identification or unsuccessful identification. Using differential approaches, as described herein, may help to filter out inconsistencies (e.g. like a cough in the input audio data) and may help avoid disrupting “jumps” (i.e. discontinuities) in the gradient.
Certain embodiments described herein may enable a neural network based voice morphing apparatus to be trained for a combination of at least three objectives: changing the sound of the voice of any speech; preserving the output audio as closely as possible to the input audio; and preserving the intelligibility of speech. In certain embodiments, the voice morphing apparatus may be trained adversarially with respect to at least a speaker identification system. This may be achieved by using a training loss function for the voice morphing apparatus that penalizes a high certainty or confidence from the speaker identification system.
In certain embodiments, to reduce a risk that the voice morphing apparatus simply learns to output random noise, an objective function may be defined that includes a first term that is dependent on the speaker identification certainty and a second term that is dependent on an audio fidelity. If the objective function comprises a loss function to be minimized, then the loss function may comprise a loss term or element that is positively weighted based on the speaker identification certainty and a loss term or element that is negatively (or inversely) weighted based on a distance score between the input and output audio data. A speaker identification term alone would tend to learn a mapping to random noise, wherein an audio fidelity term alone would tend to learn to copy the input to the output (e.g. as a simple pass through filter). However, a combined loss function, where each loss term is appropriate configured to steer the loss of the training, yields a voice morphing apparatus that anonymizes a user yet maintains features of speech that may be understood by a human or a machine and preserves non-speech audio features such as transient or constant noise. the distance score from an input to output audio signal fidelity distance model.
The systems and methods of training described herein also enable certain non-identifying features of speech audio, such as noise, gender, and accent to be preserved. For example, this may be achieved by adding additional loss function terms based on classifier outputs, e.g. as described with reference to
At block 1432, the processor is instructed to load input audio data from a data source. The data source may be internal or external. The input audio data may comprise the input audio data 120 of
Certain embodiments described herein may be applied to speech processing including automatic speech recognition. The voice morphing apparatus, once trained, may be used as part of a speech processing pipeline, e.g. a selectively applicable anonymizer that may offer users a “private” speech mode. The voice morphing apparatus may be used to enhance privacy and anonymize the labelling of training data by removing recognizable components.
Certain methods and sets of operations as described herein may be performed by instructions that are stored upon a non-transitory computer readable medium. The non-transitory computer readable medium stores code comprising instructions that, if executed by one or more computers, would cause the computer to perform steps of methods described herein. The non-transitory computer readable medium may comprise one or more of a rotating magnetic disk, a rotating optical disk, a flash random access memory (RAM) chip, and other mechanically moving or solid-state storage media.
Certain embodiments have been described herein, and it will be noted that different combinations of different components from different embodiments may be possible. Salient features are presented to better explain embodiments; however, it is clear that certain features may be added, modified and/or omitted without modifying the functional aspects of these embodiments as described.
Various embodiments are methods that use the behavior of either or a combination of humans and machines. Method embodiments are complete wherever in the world most constituent steps occur. Some embodiments are one or more non-transitory computer readable media arranged to store such instructions for methods described herein. Whatever machine holds non-transitory computer readable media comprising any of the necessary code may implement an embodiment. Some embodiments may be implemented as: physical devices such as semiconductor chips; hardware description language representations of the logical or functional behavior of such devices; and one or more non-transitory computer readable media arranged to store such hardware description language representations. Descriptions herein reciting principles, aspects, and embodiments encompass both structural and functional equivalents thereof.
Practitioners skilled in the art will recognize many possible modifications and variations. The modifications and variations include any relevant combination of the disclosed features. Descriptions herein reciting principles, aspects, and embodiments encompass both structural and functional equivalents thereof. Elements described herein as “coupled” or “communicatively coupled” have an effectual relationship realizable by a direct connection or indirect connection, which uses one or more other intervening elements. Embodiments described herein as “communicating” or “in communication with” another device, module, or elements include any form of communication or link. For example, a communication link may be established using a wired connection, wireless protocols, near-field protocols, or RFID.
The scope of the invention, therefore, is not intended to be limited to the embodiments shown and described herein. Rather, the scope and spirit of present invention is embodied by the appended claims.
Number | Name | Date | Kind |
---|---|---|---|
5567901 | Gibson | Oct 1996 | A |
5893057 | Fujimoto et al. | Apr 1999 | A |
5946658 | Miyazawa et al. | Aug 1999 | A |
8170878 | Liu | May 2012 | B2 |
10249314 | Aryal | Apr 2019 | B1 |
10839809 | Jha et al. | Nov 2020 | B1 |
11100940 | Pearson | Aug 2021 | B2 |
20080195387 | Zigel et al. | Aug 2008 | A1 |
20090030865 | Sawada | Jan 2009 | A1 |
20090281807 | Hirose | Nov 2009 | A1 |
20140195222 | Peevers | Jul 2014 | A1 |
20150336578 | Lord et al. | Nov 2015 | A1 |
20180342256 | Huffman | Nov 2018 | A1 |
20190051314 | Nakashika | Feb 2019 | A1 |
20190066658 | Fujioka | Feb 2019 | A1 |
20190304480 | Narayanan | Oct 2019 | A1 |
20200388295 | Angland | Dec 2020 | A1 |
20200395028 | Kameoka | Dec 2020 | A1 |
20210005180 | Kim | Jan 2021 | A1 |
20210193159 | Pearson | Jun 2021 | A1 |
20210200965 | Yerli | Jul 2021 | A1 |
20210217431 | Pearson | Jul 2021 | A1 |
20210225383 | Takahashi | Jul 2021 | A1 |
Number | Date | Country |
---|---|---|
2215632 | Mar 2011 | EP |
WO 2019116889 | Jun 2019 | JP |
Entry |
---|
Fang, Fuming, et al. “High-quality nonparallel voice conversion based on cycle-consistent adversarial network.” 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018. (Year: 2018). |
Hui Ye, Quality-enhanced voice morphing using maximum likelihood transformations IEEE Transactions on Audio, Speech, and Language Processing. Jun. 19, 2006;14(4):1301-12. |
Jaime Lorenzo-Trueba, Towards achieving robust universal neural vocoding. InProc Interspeech 2019 (vol. 2019, pp. 181-185). |
Denis Stadniczuk, An open-source Octave toolbox for VTLN-based voice conversion. InProc. International Conference of the German Society for Computational Linguistics and Language Technology, Darmstadt, Germany Sep. 2013. |
Fuming Fang, Speaker Anonymization Using X-vector and Neural Waveform Models. arXiv preprint arXiv:1905.13561. May 30, 2019. |
Sajedur Rahman, Pitch shifting of voices in real-time. Computer Engineering.;2:35163. |
Ido Cohn, Audio De-identification: A New Entity Recognition Task. arXiv preprint arXiv:1903.07037. Mar. 17, 2019. |
Ryan Prenger, Waveglow: A flow-based generative network for speech synthesis. InICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) May 13, 2019 (pp. 3617-3621). IEEE. |
Mohamed Abou-Zleikha, A discriminative approach for speaker selection in speaker de-identification systems. In2015 23rd European Signal Processing Conference (EUSIPCO) Sep. 4, 2015 (pp. 2102-2106). IEEE. |
Qin Jin, Speaker de-identification via voice transformation. In2009 IEEE Workshop on Automatic Speech Recognition & Understanding 2009 (pp. 529-533). IEEE. |
Fahimeh Bahmaninezhad, Convolutional Neural Network Based Speaker De-Identification. InOdyssey 2018 (pp. 255-260). |
Ching-Hsiang Ho Formant model estimation and transformation for voice morphing. InSeventh International Conference on Spoken Language Processing 2002. |
Lifa Sun, Phonetic posteriorgrams for many-to-one voice conversion without parallel data training. In2016 IEEE International Conference on Multimedia and Expo (ICME) Jul. 11, 2016 (pp. 1-6). IEEE. |
David Snyder, X-vectors: Robust dnn embeddings for speaker recognition. In2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) Apr. 15, 2018 (pp. 5329-5333). IEEE. |
Nal Kalchbrenner, Efficient neural audio synthesis. arXiv preprint arXiv: 1802.08435. Feb. 23, 2018. |
Mohamed Abou-Zleikha, A Discriminative Approach for Speaker Selection in Speaker De-Identification Systems, 2015, 23rd European Signal Processing Conference (EUSIPCO), pp. 2102-2106. |
Alice Cohen-Hadria, Voice Anonymization in Urban Sound Recordings, 2019, UMR STMS 9912M / Sorbonne Universite, IRCAM, CNRS, France. |
Fahimeh Bahmaninezhad, Convolutional Neural Network Based Speaker De-Identification, Odyssey 2018, The Speaker and Language Recognition Workshop, Jun. 26-29, 2018, Les Sables d'Olonne, France, pp. 255-260. |
Fuming Fang, Speaker Anonymization Using X-Vector and Neural Waveform Models, National Institute of Informatics, Tokyo, Japan, May 30, 2019. |
Battenberg et al. (2020). “Location-Relative Attention Mechanisms for Robust Long-Form Speech Synthesis”. ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). doi:10.1109/icassp40776.2020.9. |
Desplanques et al. “ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification”. Proc. Interspeech 2020, 3830-3834, doi: 10.21437/Interspeech.2020-2650. |
Ju-chieh Chou et al. “One-shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization”. College of Electrical Engineering and Computer Science, National Taiwan University. Aug. 2019. arXiv:1904.05742. |
Kim et al. (2017). “Joint CTC-Attention based End-to-End Speech Recognition using Multi-task Learning”. arXiv:1609.06773v2 [cs.CL]. |
Liu et al. “Any to Many Voice Conversion With Location Relative Sequence to Sequence Modeling”. IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, 2021. |
Qian et al. “Global Rhythm Style TransferWithout Text Transcriptions”. Proceedings of the 38th International Conference on MachineLearning, PMLR 139, 2021. |
Shen et al. (2018). “Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions”. arXiv: 1712.05884. |
Sisman et al. “An Overview of Voice Conversion and Its Challenges: From Statistical Modeling to Deep Learning”. IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, 2021. |
Snyder et al. “X-Vectors: Robust DNN Embeddings for Speaker Recognition” 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 5329-5333, doi: 10.1109/ICASSP.2018.8461375. |
Wan et al. (2020). “Generalized End-to-End Loss for Speaker Verification”. arXiv:1710.10467v5 [eess.AS] Nov. 9, 2020. |
Number | Date | Country | |
---|---|---|---|
20210217431 A1 | Jul 2021 | US |