This disclosure relates to machine learning systems and, more specifically, to machine learning systems to detect synthetic speech.
Deep fakes are increasingly becoming a concern of national interest that has fueled the rapid spread of fake news. Deep fakes often include audio of speech that may be manipulated with synthetic speech that emulates or clones a speaker of the original audio of speech. Synthetic speech may be generated by many different techniques for speech generation, such as applying various text to speech models. Synthetic speech may be included or injected in original audio of speech, meaning only portions of the audio of speech includes synthetic speech.
In general, the disclosure describes techniques for detecting synthetic speech in an audio clip. A system obtains an audio clip that includes speech from a speaker that may be original speech from the speaker and/or partially or wholly synthetic speech purporting to be from the speaker. The system may obtain the audio clip as audio data input by a user that wants to determine whether the audio data includes at least some synthetic speech injected in the audio data to manipulate speech of a speaker speaking. The system processes the obtained audio clip using a machine learning system trained to identify specific portions (e.g., frames) of audio clips that include synthetic speech. The machine learning system may generate speech artifact embeddings for the obtained audio clip based on synthetic speech artifact features extracted by the machine learning system. For example, the machine learning system may generate speech artifact embeddings based on the synthetic speech artifact features that indicate artifacts in synthetic speech left behind by various speech generators.
The machine learning system may compute scores for an obtained audio clip based on the generated speech artifact embeddings. The machine learning system may, for example, compute the scores by applying probabilistic linear discriminant analysis (PLDA) to compute scores for the obtained audio clip based on enrollment vectors associated with authentic speech and the speech artifact embeddings. The machine learning system may compute segment scores for frames of the obtained audio clip to determine whether one or more frames of the obtained audio clip include synthetic speech. In some instances, the machine learning system may additionally or alternatively compute an utterance level score representing a likelihood the whole waveform of the obtained audio includes synthetic speech.
The techniques may provide one or more technical advantages that realize at least one practical application. For example, the system may apply the machine learning system to detect synthetic speech in audio clips that may be interleaved with authentic speech. Conventionally, synthetic speech detection techniques focus on detecting fully synthetic audio recordings. The machine learning system, according to the techniques described herein, may be trained to identify synthetic speech based on synthetic speech artifact features left behind from various speech generation tools to avoid over-fitting detection of synthetic speech generated by any one speech generation tool. The machine learning system, according to the techniques described herein, may operate as a robust synthetic audio detector that can detect synthetic audio in both partially synthetic and fully synthetic audio waveforms. In this way, the system may indicate to a user whether an input audio clip has been modified, and which specific frames of the audio clip have been modified to include synthetic speech audio, if any.
In one example, a method includes processing, by a machine learning system, an audio clip to generate a plurality of speech artifact embeddings based on a plurality of synthetic speech artifact features. The method may further include computing, by the machine learning system, one or more scores based on the plurality of speech artifact embeddings. The method may further include determining, by the machine learning system, based on the one or more scores, whether one or more frames of the audio clip include synthetic speech. The method may further include outputting an indication of whether the one or more frames of the audio clip include synthetic speech.
In another example, a computing system may include processing circuitry and memory for executing a machine learning system. The machine learning system may be configured to process an audio clip to generate a plurality of speech artifact embeddings based on a plurality of synthetic speech artifact features. The machine learning system may further be configured to compute one or more scores based on the plurality of speech artifact embeddings. The machine learning system may further be configured to determine, based on the one or more scores, whether one or more frames of the audio clip include synthetic speech.
In another example, computer-readable storage media may include machine readable instructions for configuring processing circuitry to process, by a machine learning system, an audio clip to generate a plurality of speech artifact embeddings based on a plurality of synthetic speech artifact features. The processing circuitry may further be configured to compute one or more scores based on the plurality of speech artifact embeddings. The processing circuitry may further be configured to determine, by the machine learning system, based on the one or more scores, whether one or more frames of the audio clip include synthetic speech.
The details of one or more examples of the techniques of this disclosure are set forth in the accompanying drawings and the description below. Other features, objects, and advantages of the techniques will be apparent from the description and drawings, and from the claims.
Like reference characters refer to like elements throughout the figures and description.
Computing system 100 may represent one or more computing devices configured to execute machine learning system 110. Machine learning system 110 may be trained to detect synthetic speech in audio (e.g., audio 152). In the example of
In accordance with techniques described herein, computing system 100 may output a determination of whether audio 152 includes at least one frame of synthetic speech. Audio 152, for example, may include an audio file (e.g., waveform audio file format (WAV), MPEG-4 Part 14 (MP4), etc.) with audio of speech that may be partially or wholly synthetic. Audio 152 may be audio that is associated with video or other multimedia. As used herein, an audio clip refers to any audio stored to media. Computing system 100 may obtain audio 152 from computing device 150 via a network, for example.
Computing system 100, applying speech artifact embeddings module 112 of machine learning system 110, may generate a plurality of speech artifact embeddings for corresponding frames of audio 152. Speech artifact embeddings module 112 may generate the speech artifact embeddings as vector representations of synthetic speech artifact features of frames of audio 152 in a high-dimensional space. Speech artifact embeddings module 112 may include one or more machine learning models (e.g., Residual Neural Networks (ResNets), other neural networks such as recurrent neural networks (RNNs) or deep neural networks (DNNs), etc.) trained with training data 122 to generate speech artifact embeddings for frames of audio clips based on synthetic speech artifact features. Synthetic speech artifact features may include acoustic features of artifacts in synthetic speech of a frame in an audio clip that have been left behind by various speech generators. Speech artifact embeddings module 112 may apply acoustic feature extraction techniques to identify and extract synthetic speech artifact features of audio 152. For example, speech artifact embeddings module 112 may be trained to apply acoustic feature extraction techniques (e.g., Linear Filter Bank, Mel-Frequency Cepstral Coefficients, Power-Normalized Cepstral Coefficients, Constant Q Cepstral Coefficients, etc.) to extract synthetic speech artifact features from many different speech generators as vectors that may specify waveform artifacts in frequency regions outside the fixed spectral range of human speech. Speech artifact embeddings module 112 may extract synthetic speech artifact features from audio 152 for a predefined window of frames (e.g., 20 milliseconds). Speech artifact embeddings module 112 may include a timestamp in vectors of the synthetic speech artifact features specifying a time frame of audio 152 (e.g., 20 milliseconds to 40 milliseconds of audio included in audio 152) corresponding to extracted speech artifact features.
Speech artifact embeddings module 112 may be trained to extract synthetic speech artifact features based on training data 122. Training data 122 is stored to a storage device and includes training audio clips with one or more frames of audio including synthetic speech generated by various speech generators. In some instances, prior to or at the same time as extracting synthetic speech artifact features from training audio clips of training data 122, speech artifact embeddings module 112 may apply a machine learning model (e.g., a deep neural network) to remove non-speech information (e.g., silences, background noise, etc.) from the training audio clips of training data 122. Speech artifact embeddings module 112 may determine non-speech information from training audio clips of training data 122 and remove vectors of synthetic speech artifact features corresponding to time frames of the determined non-speech information. Speech artifact embeddings module 112 may apply the machine learning model (e.g., a speech activity detector) to identify non-speech information in audio 152 and remove, based on timestamps included in vectors of the synthetic speech artifact features, synthetic speech artifact features associated with audio 152 that correspond to the identified non-speech instances. In this way, speech artifact embeddings module 112 may effectively extract synthetic speech artifact features that do not consider non-speech information that may overwhelm critical information that synthetic speech artifact features are based upon.
Speech artifact embeddings module 112 may process synthetic speech artifact features associated with audio 152 using the one or more machine learning models to generate speech artifact embeddings for frames of audio 152 that capture distortions or frequency artifacts associated with audio waveforms in frames of audio 152 that may have been generated by a speech generator. Speech artifact embeddings module 112 may include a timestamp of a frame in a speech artifact embedding generated for the frame. Speech artifact embeddings module 112 may generate the speech artifact embeddings based on synthetic speech artifact features by training an embeddings or latent space of the one or more machine learning models of speech artifact embeddings module 112 with synthetic speech artifact features extracted from synthetic speech clips of training data 122. In some examples, speech artifact embeddings module 112 may train the one or more machine learning models to generate speech artifact embeddings based on audio clips by mapping speech artifact features (e.g., synthetic speech artifact features and/or authentic speech artifact features) to an embedding space of the one or more machine learning models. Speech artifact embeddings module 112 may determine boundaries in the mapping of the speech artifact features based on labels of speech clips included in training data 122 identifying whether audio waveform frames corresponding to the speech artifact features include synthetic speech. Speech artifact embeddings module 112 may apply the boundaries during training of the one or more machine learning models to improve generalization of synthetic speech artifact features represented in speech artifact embeddings across unknown conditions.
Computing system 100 may generate speech artifact embeddings as vector representations of distortions included in audio 152 that may have been created by one or more speech generators of various types. Speech artifact embeddings module 112 may train the one or more machine learning models to generate speech artifact embeddings based in part on training data 122. In some instances, speech artifact embeddings module 112 may augment training audio clips included in training data 122 to improve generalizations made about audio clips, avoid the one or more machine learning models over fitting to any one speech generator, and/or defeat anti-forensic techniques that may be implemented by synthetic speech generators. For example, by speech artifact embeddings module 112 augmenting training audio clips of training data 122, machine learning system 210 may be trained to be more robust as to overcome deliberate augmentations to synthetic speech that may be implemented by synthetic speech generators. Speech artifact embeddings module 112 may augment training audio clips of training data 122 using one or more data augmentation strategies. For example, speech artifact embeddings module 112 may augment training audio clips of training data 122 by injecting different types of audio degradation (e.g., reverb, compression, instrumental music, noise, etc.) to the training audio clips. In some examples, speech artifact embeddings module 112 may augment training audio clips of training data 122 by applying frequency masking techniques. Speech artifact embeddings module 112 may apply frequency masking techniques to training audio clips of training data 122 to randomly dropout frequency bands during training of the one or more machine learning models of speech artifact embeddings module 112.
Scoring module 132 of machine learning system 110 may generate one or more scores based on the speech artifact embeddings generated by speech artifact embeddings module 112. Scoring module 132 may apply probabilistic linear discriminant analysis (PLDA) to the speech artifact embeddings to generate probabilities (e.g., log-likelihood ratios) for each speech artifact embedding that corresponds to a likelihood a frame associated with the speech artifact embedding includes synthetic speech. For example, scoring module 132 may determine a probability that a frame corresponding to a speech artifact embedding includes synthetic speech by comparing the speech artifact embedding to an enrollment embedding associated with authentic speech. Scoring module 132 may determine the probabilities based on enrollment embeddings that may include a vector representation of authentic speech artifact features from authentic speech in audio clips (e.g., training speech clips of training data 122). Enrollment embeddings may include a vector representation of authentic speech artifact features such as pitch, intonation, rhythm, articulation, accent, pronunciation pattern, or other human vocal characteristics. In some instances, scoring module 132 may apply a machine learning model (e.g., residual networks, neural networks, etc.) to generate enrollment embeddings based on training speech clips of training data 122 that include authentic speech.
Scoring module 132 may convert each of the probabilities for each speech artifact embedding to segment scores for corresponding frames that represent whether the corresponding frames include synthetic speech. Scoring module 132 may label segment scores with corresponding timestamps associated with frames that corresponding speech artifact embeddings represent. Scoring module 132 may determine whether one or more frames of an audio clip include synthetic speech based on the segment scores. For example, scoring module 132 may determine a frame of audio 152 includes synthetic speech based on a segment score associated with the frame satisfying a threshold (e.g., a segment score greater than zero). Additionally, or alternatively, scoring module 132 may determine a frame of audio 152 does not include synthetic speech, and is authentic, based on a segment score associated with the frame satisfying a threshold (e.g., a segment score less than 0.2). Scoring module 132 may determine specific time frames of audio 152 where synthetic speech was detected based on timestamps corresponding to the segment score. Computing system 100 may output an indication of the determination of which frames, if any, of audio 152 include synthetic speech to computing device 150. The indication may include specific references to the time frames in which synthetic speech was detected. Computing device 150 may output the indication via GUI 154.
In some instances, scoring module 132 may generate an utterance level score for the whole waveform of an audio clip (e.g., audio 152) based on the segment scores for each from of the audio clip. For example, scoring module 132 may generate an utterance level score for an audio clip by averaging all segment scores. Computing system 100 may output the segment scores and utterance level scores to computing device 150. Computing device 150 may output the segment scores and utterance level score via GUI 154 to allow a user to identify whether one or more frames of audio 152 include synthetic speech.
The techniques may provide one or more technical advantages that realize at least one practical application. For example, machine learning system 110 may determine whether only a portion or the entirety of an audio clip includes synthetic speech. Speech artifact embeddings module 112 of machine learning system 110, in generating speech artifact embeddings for frames or sets of frames of audio 152, allows machine learning system 110 to determine specific temporal locations of synthetic speech that may have been injected in audio 152. Machine learning system 110 may train one or more machine learning models of speech artifact embeddings module 112 to generate speech artifact embeddings based on synthetic speech artifact features to ultimately detect synthetic speech generated by many different speech generators. By augmenting and refining the training data used in the training of the one or more machine learning models, machine learning system 110 may avoid overfitting the one or more machine learning models to specific speech generators. Machine learning system 110 may train the one or more machine learning models for robust synthetic speech detection of synthetic speech generated by any number or variety of speech generators.
Computing system 200 comprises any suitable computing system having one or more computing devices, such as servers, desktop computers, laptop computers, gaming consoles, smart televisions, handheld devices, tablets, mobile telephones, smartphones, etc. In some examples, at least a portion of computing system 200 is distributed across a cloud computing system, a data center, or across a network, such as the Internet, another public or private communications network, for instance, broadband, cellular, Wi-Fi, ZigBee, Bluetooth® (or other personal area network-PAN), Near-Field Communication (NFC), ultrawideband, satellite, enterprise, service provider and/or other types of communication networks, for transmitting data between computing systems, servers, and computing devices.
Computing system 200, in the example of
One or more output devices 208 may generate, transmit, or process output. Examples of output are tactile, audio, visual, and/or video output. Output devices 208 may include a display, sound card, video graphics adapter card, speaker, presence-sensitive screen, one or more USB interfaces, video and/or audio output interfaces, or any other type of device capable of generating tactile, audio, video, or other output. Output devices 208 may include a display device, which may function as an output device using technologies including liquid crystal displays (LCD), quantum dot display, dot matrix displays, light emitting diode (LED) displays, organic light-emitting diode (OLED) displays, cathode ray tube (CRT) displays, e-ink, or monochrome, color, or any other type of display capable of generating tactile, audio, and/or visual output. In some examples, computing system 200 may include a presence-sensitive display that may serve as a user interface device that operates both as one or more input devices 206 and one or more output devices 208.
One or more communication units 207 of computing system 200 may communicate with devices external to computing system 200 (or among separate computing devices of computing system 200) by transmitting and/or receiving data, and may operate, in some respects, as both an input device and an output device. In some examples, communication units 207 may communicate with other devices over a network. In other examples, communication units 207 may send and/or receive radio signals on a radio network such as a cellular radio network. Examples of communication units 207 may include a network interface card (e.g., such as an Ethernet card), an optical transceiver, a radio frequency transceiver, a GPS receiver, or any other type of device that can send and/or receive information. Other examples of communication units 207 may include Bluetooth®, GPS, 3G, 4G, and Wi-Fi® radios found in mobile devices as well as Universal Serial Bus (USB) controllers and the like.
Processing circuitry 202 and memory 204 may be configured to execute machine learning system 210 to determine whether an input audio clip includes synthetic speech, according to techniques of this disclosure. Memory 204 may store information for processing during operation of speech artifact embeddings module 212 and scoring module 232. In some examples, memory 204 may include temporary memories, meaning that a primary purpose of the one or more storage devices is not long-term storage. Memory 204 may be configured for short-term storage of information as volatile memory and therefore not retain stored contents if deactivated. Examples of volatile memories include random access memories (RAM), dynamic random access memories (DRAM), static random access memories (SRAM), and other forms of volatile memories known in the art. Memory 204, in some examples, also include one or more computer-readable storage media. Memory 204 may be configured to store larger amounts of information than volatile memory. Memory 204 may further be configured for long-term storage of information as non-volatile memory space and retain information after activate/off cycles. Examples of non-volatile memories include magnetic hard disks, optical discs, floppy disks, Flash memories, or forms of electrically programmable memories (EPROM) or electrically erasable and programmable (EEPROM) memories. Memory 204 may store program instructions and/or data associated with one or more of the modules (e.g., speech artifact embeddings module 212 and scoring module 232 of machine learning system 210) described in accordance with one or more aspects of this disclosure.
Processing circuitry 202 and memory 204 may provide an operating environment or platform for speech artifact embeddings module 212 and scoring module 232, which may be implemented as software, but may in some examples include any combination of hardware, firmware, and software. Processing circuitry 202 may execute instructions and memory 204 may store instructions and/or data of one or more modules. The combination of processing circuitry 202 and memory 204 may retrieve, store, and/or execute the instructions and/or data of one or more applications, modules, or software. Processing circuitry 202 and memory 204 may also be operably coupled to one or more other software and/or hardware components, including, but not limited to, one or more of the components illustrated in
In the example of
In accordance with techniques described herein, machine learning system 210 may detect whether one or more frames of an audio clip include synthetic speech. Machine learning system 210 may obtain input data 244 that includes an audio clip (e.g., audio 152 of
Machine learning system 210 may process the audio clip included in input data 244 using speech artifact embeddings module 212. Feature extraction module 216 of speech artifact embeddings module 212 may extract synthetic speech artifact features from the audio clip. For example, feature extraction module 216 may apply a filter bank (e.g., Linear Filter Bank, Mel-Frequency Cepstral Coefficients, Power-Normalized Cepstral Coefficients, Constant Q Cepstral Coefficients, etc.) to extract distortions and/or degradations of speech audio included in the audio clip according to a predefined frame rate (e.g., extracting distortions and/or degradations for time windows or frames capturing 20 to 30 millisecond portions of an audio waveform included in an audio clip). Feature extraction module 216 may encode synthetic speech artifact features of speech audio for each frame of the audio clip (e.g., a frame or time window of audio corresponding to a segment of an audio waveform included in the 20 millisecond to 30 millisecond portion of the audio clip) as vector representations of distortions or degradations identified in corresponding frames. Feature extraction module 216 may include a timestamp in each synthetic speech artifact feature vector specifying a corresponding frame (e.g., an indication included in metadata of a synthetic speech artifact feature vector that the represented synthetic speech artifact features were extracted from the 20 millisecond to 30 millisecond frame of the audio clip).
Training module 214, in the example of
Training module 214 may train feature extraction module 216 to extract synthetic speech artifact features. Training module 214 may train feature extraction module 216 based on training speech clips in training data 222. For example, training module 214 may train feature extraction module 216 with training speech clips stored at training data 222 that include training audio clips with partially or wholly synthetic speech generated by various synthetic speech generators. Training module 214 may train a filter bank (e.g., Linear Filter Bank, Mel-Frequency Cepstral Coefficients, Power-Normalized Cepstral Coefficients, Constant Q Cepstral Coefficients, etc.) of feature extraction module 216 to extract synthetic speech artifacts (e.g., frequency regions outside the fixed spectral range of human speech) with high spectral resolution over a predefined time window. For example, training module 214 may train feature extraction module 216 to apply 70 triangular linearly spaced filters to extract synthetic speech artifact features from audio clips with a 25 millisecond window and a 10 millisecond frameshift.
Feature extraction module 216 may refine the synthetic speech artifact features by removing synthetic speech artifact features that may correspond to frames where there is no speech. Feature extraction module 216 may include a speech activity detector with a machine learning model (e.g., a deep neural network) trained to identify pauses, silences, background noise, or other non-speech information included in an audio clip. Feature extraction module 216 may apply the speech activity detector to identify non-speech information in the audio clip included in input data 244. Feature extraction module 216 may apply the speech activity detector to identify non-speech information over the same time window as the filter bank of feature extraction module 216 extracted synthetic speech artifact features. The speech activity detector of feature extraction module 216 may output a Boolean value for each frame of the audio clip included in input data 244 (e.g., audio corresponding to waveforms included in the 20 millisecond to 30 millisecond frame of the audio clip) specifying whether non-speech information was detected (e.g., output a value of 1 if a pause or silence is detected or output a value of 0 if speech is detected). The speech activity detector may include a timestamp in each output specifying a corresponding frame (e.g., an indication output with the Boolean value that speech or silence was detected from the 20 millisecond to 30 millisecond frame of the audio clip). Feature extraction module 216 may remove or prune synthetic speech artifact features generated by the filter bank based on outputs of the speech activity detector identifying frames of the audio clip with non-speech information. Feature extraction module 216 may provide machine learning model 218 the synthetic speech artifact features associated with the audio clip included in input data 244.
Training module 214 may train machine learning model 218 to generate speech artifact embeddings. Training module 214 may train machine learning model 218 based on training speech clips included in training data 222. In some instances, training module 214 may train machine learning model 218 with augmented training speech clips of training data 222 with various data augmentation strategies. For example, training module 214 may augment training speech clips of training data 222 with different types of audio degradation (e.g., reverb, compression, instrumental music, noise, etc.). Training module 214 may additionally, or alternatively, augment training speech clips of training data 222 by applying frequency masking to randomly dropout frequency bands from the training speech clips. Training module 214 may augment training speech clips of training data 222 to avoid poor deep-learning model performance and model over fitting.
In some examples, training module 214 is implemented by a separate training computing system that trains machine learning model 218 as described above. In such examples, trained machine learning model 218 is exported to computing system 200 for use in detecting synthetic speech.
Machine learning model 218 may generate speech artifact embeddings based on synthetic speech artifact features. Machine learning model 218 may include a machine learning model, such as a deep neural network with an X-ResNet architecture, trained to generate speech artifact embeddings that capture relevant information of artifacts, distortions, degradations, or the like from synthetic speech artifact features extracted from feature extraction module 216. Machine learning model 218 may include a deep neural network with an X-ResNet architecture that utilizes more discriminant information from input features. For example, machine learning model 218 may be a deep neural network including a residual network architecture with a modified input stem including 3×3 convolutional layers with stride 2 in the first layer for down sampling, with 32 filters in the first two layers, and 64 filters in the last layer. Machine learning model 218 may provide the modified input stem the synthetic speech artifact features with a height dimension of 500 corresponding to a temporal dimension (e.g., frames of the audio clip included in input audio 244), a width dimension of 70 corresponding to a filter bank index, and a depth dimension of 1 corresponding to image channels. Machine learning model 218 may include one or more residual blocks of the residual network that serially down clip inputs from the modified input stem or previous residual blocks, and doubles the number of filters to keep the computation constant. Machine learning model 218 may include residual blocks that down clip inputs with 2×2 average pooling for anti-aliasing benefits and/or a 1×1 convolution to increase the number of feature maps that match a residual path's output. In some instances, machine learning model 218 may include Squeeze-and-Excitation (SE) blocks to adaptively re-calibrate convolution channel inter-dependencies into a global feature such that the dominant channels can achieve higher weights. Machine learning model 218 may implement SE blocks throughout the residual network (e.g., after processing by the modified input stem and before the first residual block). Machine learning model 218 may provide the final output of the residual blocks or stages of the residual network to a statistical pooling and embeddings layer of the residual network for further processing. Machine learning model 218 may extract embeddings from the last layer of the residual network as speech artifact embeddings.
During training of machine learning model 218, training module 214 may train machine learning model 218 to output speech artifact embeddings. For example, training module 214 may apply a one-class feature learning approach to train a compact embeddings space of the residual network of machine learning model 218 by introducing margins to consolidate target authentic speech and isolate synthetic speech data. Training module 214 train the embeddings space of the residual network of machine learning model 218 according to the following function:
where xi ∈D represents the normalized target-class, ŵ0 ∈ D represents the weight vector, yi ∈ 0, 1 denotes clip labels (e.g., 0 for synthetic and 1 for authentic), and m0, m1 ∈ [−1,1], where m0>m1 are the angular margins between classes. Training module 214 may apply the function to train the embeddings space of the residual network of machine learning model 218 to establish one or more boundaries in the embeddings space corresponding to whether an extracted synthetic speech artifact feature should be included in a speech artifact embedding. For example, training module 214 may provide extracted synthetic speech artifact features from training speech clips of training data 222 to machine learning model 218. Machine learning model 218 may apply a machine learning model (e.g., a residual network) to process the input synthetic speech artifact features. Machine learning model 218 may map the processed synthetic speech artifact features to an embeddings space of the residual network. Training module 214 may apply the function to the processed synthetic speech artifact features to determine one or more boundaries in the mapping of the processed synthetic speech artifact features to the embeddings space that outline an area in the embeddings space mapping where processed synthetic speech artifact features correspond to either synthetic or authentic speech based on labels of the training speech clip associated with the processed synthetic speech artifact features. In this way, the residual network of machine learning model 218 may be trained with improved deep neural network generalization across unknown conditions. Machine learning model 218 may apply the boundaries during inference time to determine which of the processed synthetic speech artifact features should be represented in speech artifact embeddings. For example, during inference time, machine learning model 218 may map processed synthetic speech artifact features associated with a frame of an input audio clip to the embeddings space and generate a speech artifact embedding to include a vector representation of the processed synthetic speech artifact features that were mapped within the area of the embeddings space corresponding to synthetic speech.
During the inference phase of machine learning system 210 determining whether an audio clip of input data 244 includes at least one frame of synthetic speech, speech artifact embeddings module 212 may provide the speech artifact embeddings generated by machine learning model 218 to scoring module 232. Scoring module 232 may include a probabilistic linear discriminant analysis (PLDA) back-end classifier. Scoring module 232 may leverage the PLDA classifier to provide better generalization across real-world data conditions. For example, in instances where interleaved audio is detected, scoring module 232 may apply the PLDA classifier for highly accurate interleaved aware score processing based on window-score smoothing.
Scoring module 232 may compute one or more scores for frames of the audio clip included in input data 244 based on the speech artifact embeddings. For example, scoring module 232 may apply PLDA to compute scores based on speech artifact embeddings and enrollment embeddings that represent speech artifact features associated with authentic speech. Scoring module 232 may reduce dimensions of speech artifact embeddings generated by speech artifact embeddings module 212 with a linear discriminant analysis (LDA) transformation and gaussianization of the input speech artifact embeddings. For example, scoring module 232 may process speech artifact embeddings according to the following equation:
where wi represents the transformed speech artifact embeddings, μ represents the mean vector, U1 represents the eigen matrix, x1 represents the hidden factor, and ∈i represents the residual variability. Scoring module 232 may provide the transformed speech artifact embeddings to segment score module 234.
Segment score module 234 may compute segment scores for each frame of the audio clip included in input data 244 based on the speech artifact embeddings. For example, segment score module 234 may determine a segment score as a likelihood synthetic speech was injected in a frame of the audio clip by comparing speech artifact embeddings transformed using LDA to enrollment vectors. Scoring module 232 may provide segment score module 234 enrollment embeddings that include vectors representing speech artifact features of authentic speech. Enrollment embeddings may include vectors representing features of authentic speech based on clips of authentic speech included in training data 222. Scoring module 232 may obtain enrollment embeddings from an administrator operating computing system 200 or may generate enrollment embeddings with a machine learning model trained to embed speech artifact features similar to speech artifact embeddings module 212. Segment score module 234 may compute a segment score (e.g., log-likelihood ratio, value between 0-1, etc.) for a frame of the audio clip included in input data 244 by comparing a corresponding speech artifact embedding (e.g., a corresponding transformed speech artifact embedding) to the enrollment vectors using PLDA. Segment score module 234 may determine temporal locations (e.g., frames) of the audio clip with synthetic speech based on the segment scores for each frame of the audio clip. For example, segment score module 234 may determine the 20 millisecond to 30 millisecond frame of the audio clip includes synthetic speech based on a corresponding segment score satisfying a threshold (e.g., the corresponding segment score is greater than 0.0). Segment score module 234 may output indications of the temporal locations that include synthetic speech as output data 248.
In some instances, utterance score module 236 may compute an utterance level score representing whether the whole waveform of an audio clip includes synthetic speech. Utterance score module 236 may determine an utterance level score for an audio clip based on segment scores. Utterance score module 236 may obtain the segment scores from segment score module 234. Utterance score module 236 may determine an utterance level score by averaging all segment scores determined by segment score module 234. In some examples, utterance score module 236 may apply simple interleaved aware score post-processing based on window-score smoothing to determine an utterance level score. For example, utterance score module 236 may smooth the segment scores output by segment score module 234 with a multiple window mean filter of ten frames. Utterance score module 234 may average the score of the top 5% smoothed window scores to determine the utterance score for an entire waveform of the audio clip included in input data 244. Utterance score module 234 may determine whether the entire waveform of an input audio clip includes synthetic speech based on the utterance score. Utterance score module 234 may output an indication of whether the entire waveform of the input audio clip includes synthetic speech as output data 248.
In some instances, computing system 100 may output data for generating graphical user interface 354 to computing device 150. That is, graphical user interface 354 of
Additionally, or alternatively, scoring module 232, or more specifically utterance score module 236, may calculate an utterance level score or global score for audio clip 352. Utterance score module 236 may calculate the global score based on segment scores calculated for frames of audio clip 352. Utterance score module 236 may output the global score as global score 338. In the example of
Computing system 200 may process an audio clip to generate a plurality of speech artifact embeddings based on a plurality of synthetic speech artifact features (402). For example, computing system 200 may obtain audio 152 from a user device (e.g., computing device 150) or from a user operating computing system 200. Computing system 200 may generate, with machine learning system 210, a plurality of speech artifact embeddings for audio 152 based on a plurality of synthetic speech artifact features. For example, feature speech artifact embeddings module 212 of machine learning system 210 may process audio 152 to extract the plurality of synthetic speech artifact features from audio 152 that represent potential artifacts, distortions, degradations, or the like that have been left behind by a variety of synthetic speech generators. Machine learning model 218 of speech artifact embeddings module 212 may generate the plurality of speech artifact embeddings by processing the extracted synthetic speech artifact features to identify processed synthetic speech artifact features that correspond to relevant information of synthetic speech generation (e.g., processed synthetic speech artifact features that are within a boundary defined as features left behind by various synthetic speech generators).
Computing system 200 may compute one or more scores based on the plurality of speech artifact embeddings (404). For example, scoring module 232, or more specifically segment score module 234, may obtain the speech artifact embeddings from machine learning model 218 to generate a segment score for each speech artifact embedding by comparing a speech artifact embedding to an enrollment embedding representing speech artifact features of authentic speech. Segment score module 234 may apply PLDA to generate segment scores as a log-likelihood ratio a frame of input audio 152 includes synthetic speech. Scoring module 232, or more specifically utterance score module 236, may additionally, or alternatively, generate an utterance level score representing whether the waveform of audio 152, as a whole, includes synthetic speech generated by various synthetic speech generators. Utterance score module 236 may, for example, generate an utterance level score for audio 152 applying a simple interleaved aware score post-processing based on window-score smoothing to segment scores generated by segment score module 234.
Computing system 200 may determine, based on one or more scores, whether one or more frames of the audio clip include synthetic speech (406). Scoring module 232 may determine whether a frame of audio 152 includes synthetic speech based on a segment score generated by segment score module 234 satisfying a threshold. For example, scoring module 232 may determine a frame (e.g., the 20 millisecond to 30 millisecond frame of audio 152) includes synthetic speech based on a corresponding segment score satisfying (e.g., greater than or less than) a threshold segment score of 0.0. Scoring module 232 may output an indication of whether the one or more frames include synthetic speech (408). For example, scoring module 232 may output an indication as either a probability or Boolean value (e.g., “Yes” or “No”) associated with whether one or more frames of audio 152 include synthetic speech. Scoring module 232 may include, as part of the indication, particular time frames of audio 152 associated with the synthetic speech. Scoring module 232 may output the indication as output data 248. In some examples, scoring module 232 may output the indication via output devices 208.
The techniques described in this disclosure may be implemented, at least in part, in hardware, software, firmware or any combination thereof. For example, various aspects of the described techniques may be implemented within one or more processors, including one or more microprocessors, digital signal processors (DSPs), application specific integrated circuits (ASICs), field programmable gate arrays (FPGAs), or any other equivalent integrated or discrete logic circuitry, as well as any combinations of such components. The term “processor” or “processing circuitry” may generally refer to any of the foregoing logic circuitry, alone or in combination with other logic circuitry, or any other equivalent circuitry. A control unit comprising hardware may also perform one or more of the techniques of this disclosure.
Such hardware, software, and firmware may be implemented within the same device or within separate devices to support the various operations and functions described in this disclosure. In addition, any of the described units, modules or components may be implemented together or separately as discrete but interoperable logic devices. Depiction of different features as modules or units is intended to highlight different functional aspects and does not necessarily imply that such modules or units must be realized by separate hardware or software components. Rather, functionality associated with one or more modules or units may be performed by separate hardware or software components or integrated within common or separate hardware or software components.
The techniques described in this disclosure may also be embodied or encoded in computer-readable media, such as a computer-readable storage medium, containing instructions. Instructions embedded or encoded in one or more computer-readable storage mediums may cause a programmable processor, or other processor, to perform the method, e.g., when the instructions are executed. Computer readable storage media may include random access memory (RAM), read only memory (ROM), programmable read only memory (PROM), erasable programmable read only memory (EPROM), electronically erasable programmable read only memory (EEPROM), flash memory, a hard disk, a CD-ROM, a floppy disk, a cassette, magnetic media, optical media, or other computer readable media.
This application claims the benefit of U.S. Patent Application No. 63/465,740, filed May 11, 2023, which is incorporated by reference herein in its entirety.
This invention was made with government support under contract number HR001120C0124 awarded by DARPA. The government has certain rights in the invention.
Number | Date | Country | |
---|---|---|---|
63465740 | May 2023 | US |