The present disclosure generally pertains to the field of audio processing, in particular to methods and devices for audio analysis.
With the emergence of powerful deep neural networks (DNNs) and the corresponding computer-chips, especially at low prices, the manipulation of image content, video content or audio content became much easier and more widespread. A manipulation of image content, video content or audio content with DNNs (called “deepfakes”) and thus the creation of realistic video, image, and audio fakes has become possible even for non-experts without much effort and without much background knowledge. For example, it has become possible to alter parts of a video, like for example the lip movement of a person, or to alter parts of an image, like for example the facial expression of a person, or to alter an audio file, like for example a speech of a person. This technique could be used for large-scale fraud or to spread realistic fake news in the political arena.
Therefore, it is desirable to improve the detection of audio content that has been manipulated by DNNs.
According to a first aspect, the disclosure provides a method comprising determining at least one audio event based on an audio waveform and determining a deepfake probability for the audio event.
According to a second aspect, the disclosure provides an electronic device comprising circuitry configured to determining at least one audio event based on an audio waveform and determining a deepfake probability for the audio event.
Further aspects are set forth in the dependent claims, the following description and the drawings.
Embodiments are explained by way of example with respect to the accompanying drawings, in which:
The embodiments disclose a method comprising determining at least one audio event based on an audio waveform and determining a deepfake probability for the audio event.
An audio event may be any part (or the complete) of the audio waveform and can be in the same format as the audio waveform or in any other audio format. An audio event can also be a spectrogram of any part (or the complete) of the audio waveform, in which case it is denoted as audio event spectrogram.
The audio waveform may be a vector of samples of an audio file. The audio waveform may be any kind of common audio waveform, for example a piece of music (i.e. a song), a speech of a person, or a sound like a gunshot or a car motor. The stored audio waveform can for example be stored as WAV, MP3, AAC, FLAC, WMV etc.
According to the embodiments the deepfake probability may indicate a probability that the audio waveform has been altered and/or distorted by artificial intelligence techniques or has been completely generated by artificial intelligence techniques.
According to the embodiments the audio waveform may relate to media content such as audio or video file or live stream.
According to the embodiments the determining of the at least one audio event may comprise determining an audio event spectrogram of the audio waveform or of a part of the audio waveform.
According to the embodiments the method may further comprise determining the deepfake probability for an audio event with a trained DNN classifier.
The trained DNN classifier may output a probability that the audio event is a deepfake, which may also be indicated as fake probability value of the DNN classifier, and which may in this embodiment be equal to the deepfake probability of the audio event.
According to the embodiments determining at least one audio event may comprise performing audio source separation on the audio waveform to obtain a vocal or speech waveform, and wherein the deepfake probability is determined based on an audio event spectrogram of the vocal or speech waveform.
In another embodiment the audio source separation may separate another instrument (track) or another sound class (e.g., environmental sounds like being in a Café, being in a car etc.) of the audio waveform than the vocal waveform.
According to the embodiments determining at least one audio event may comprise determining one or more candidate spectrograms of the audio waveform or of a part of the audio waveform, labeling the candidate spectrograms by a trained DNN classifier, and filtering the labelled spectrograms according to their label to obtain the audio event spectrogram.
The trained DNN classifier may be trained to sort the input spectrograms into different classes. The processes of linking a specific spectrogram with the class that it was sorted into by the trained DNN classifier may be referred to as labeling. The labeling may for example be storing a specific spectrogram together with its assigned class into a combined data structure. The labeling may for example also storing a pointer from a specific spectrogram to its assigned class.
According to the embodiments determining the deepfake probability for the audio event may comprise determining an intrinsic dimension probability value of the audio event.
An intrinsic dimension probability value of an audio event may be a value which indicates the probability that an audio event audio event is a deepfake, which is determined based on the intrinsic dimension of the audio event.
According to the embodiments the intrinsic dimension probability value may be based on a ratio of an intrinsic dimension of the audio event and a feature space dimension of the audio event and an intrinsic dimension probability function.
According to the embodiments determining the deepfake probability for the audio event spectrogram is based on determining a correlation probability value of the audio event spectrogram.
A correlation probability value of the audio event spectrogram may be a probability value which indicates the probability that an audio event audio event spectrogram is a deepfake, which is determined based on a correlation value between the audio event spectrogram and a spectrogram which is known to be real (i.e. not a deepfake).
According to the embodiments determining the correlation probability value is calculated based on a correlation probability function and a normalized cross-correlation between a resized stored real audio event spectrogram of a recording noise floor and noise-only parts of the audio event spectrogram.
According to the embodiments determining the method may further comprise determining a plurality of audio events based on the audio waveform, determining a plurality of deepfake probabilities for the plurality of audio events, and determining an overall deepfake probability of the audio waveform based on the plurality of deepfake probabilities.
According to the embodiments the method may further comprise determining a modified audio waveform by overlaying a warning message over the audio waveform based on the deepfake probability.
According to the embodiments the method may further comprise outputting a warning based on the deepfake probability.
The embodiments disclose an electronic device comprising circuitry configured to determining at least one audio event based on an audio waveform and determining a deepfake probability for the audio event.
Circuitry may include a processor, a memory (RAM, ROM or the like), a GPU, a storage, input means (mouse, keyboard, camera, etc.), output means (display (e.g. liquid crystal, (organic) light emitting diode, etc.), loudspeakers, etc., a (wireless) interface, etc., as it is generally known for electronic devices (computers, smartphones, etc.). A DNN may for example be realized and trained by a GPU (graphics processing unit) which may increase the speed of deep-learning systems by about 100 times because the GPUs may be well-suited for the matrix/vector math involved in deep learning.
Embodiments are now described by reference to the drawings.
A deepfake is a media content, like a video or audio file or stream, which has been in parts altered and or distorted by artificial intelligence techniques or which is completely generated by artificial intelligence techniques. Artificial intelligence techniques which are used to generate a deepfake comprise different machine learning methods like artificial neural networks especially deep neural networks (DNN). For example, an audio deepfake may be an audio file (like a song or a speech of a person), which has been altered and or distorted by a DNN. The term deepfake may refer to the spectrogram (in this case also called deepfake spectrogram) of an audio file deepfake or it may refer to the audio file deepfake itself. The audio deepfake may for example be generated by applying audio-changing artificial intelligence techniques directly to an audio file or by applying audio-changing artificial intelligence techniques to a spectrogram of an audio file and then generating the changed audio file by re-transforming the changed spectrogram back into audio format (for example by means of an inverse short time Fourier transform).
The audio pre-processing unit 101 pre-processes the complete audio waveform xϵn or parts or the audio waveform xϵn in order to detect and output multiple audio events x1, . . . , xK, with Kϵ. This pre-processing 101 may for example comprise applying a short time Fourier transform (STFT) to parts or the complete audio waveform xϵn, which yield audio events x1, . . . , xK in the form of audio event spectrograms as described below in more detail with regard to
The audio events (or audio event spectrograms) x1, . . . , xK are forwarded to a deepfake detector 102, which determines deepfake probabilities Pdeepfake,1, . . . , she might ldeepfake,K for the audio events (or audio event spectrograms) x1, . . . , xK which indicate a respective probability for each of the audio events (or audio event spectrograms) x1, . . . , xK of being a (computer-generated) deepfake. Embodiments of a deepfake detector are descried in more detail below with regard to
The overall deepfake probability Pdeepfake,overall of the audio waveform xϵn is output form the combination unit 103 and input into a information overlay unit 104. The information overlay unit 104 further receives the audio waveform xϵn as input and, if the overall deepfake probability Pdeepfake,overall of the audio waveform xϵn indicates that the audio waveform xϵn is a deepfake, the information overlay unit 104 adds (overlays) a warning message to the audio waveform xϵn, which yields a modified audio waveform x′ϵn. The warning message of the modified audio waveform x′ϵn can be played before or during the audio waveform xϵn is played to the listener to warn the listener that the audio waveform xϵn might be a deepfake. In another embodiment the audio waveform xϵn is directly played by the information overlay unit and if the overall deepfake probability Pdeepfake,overall of the audio waveform xϵn is above a predetermined threshold, for example 0.5, a warning light at the smart loudspeaker system for audio deep fake detection 100 is turned on. In another embodiment the deep fake detector smart loudspeaker system 100 may constantly display a warning or trust level of the currently played part of the audio waveform xϵn at a screen display to the user, wherein the warning or trust level is based on the deepfake probabilities Pdeepfake,1, . . . , Pdeepfake,K and/or the overall deepfake probability Pdeepfake,overall of the audio waveform xϵn. The information overlay unit 104 is described in more detail below.
The smart loudspeaker system for audio deep fake detection 100 as shown in
In a first embodiment, the smart loudspeaker system for audio deepfake detection 100 may analyse the audio waveform xϵn in advance, i.e. before it is played out, i.e. the audio waveform xϵn is a stored audio waveform xϵn. This can be described as an off-line operational mode. In another embodiment the smart loudspeaker system for audio deep fake detection 100 may verify an audio waveform xϵn while it is played out, which can be described as on-line operational mode. In this case the pre-processing unit 101 receives the currently played part of an audio waveform xϵn as an input stream, which should be verified for authenticity. The audio pre-processing unit 101 may buffer the currently played parts of the audio waveform xϵn for a predetermined time span, for example 1 second or 5 seconds or 10 seconds, an then pre-process this buffered part xϵn of the audio stream.
The deepfake detection as described in the embodiment of
A music source separation 301 receives the audio waveform xϵn as input. In this embodiment the audio waveform xϵn is a piece of music. The music source separation separates the received audio waveform xϵn according to predetermined conditions. In this embodiment the predetermined condition is to separate a vocal track xv from the rest of the audio waveform xϵn. The music source separation unit 301 (which may also perform upmixing) is described in more detail in
The short-time Fourier transform STFT is a technique to represent the change in the frequency spectrum of a signal over time. While the Fourier transform as such does not provide information about the change of the spectrum over time, the STFT is also suitable for signals whose frequency characteristics change over time. To realize the short-time Fourier transform STFT, the time signal is divided into individual time segments with the help of a window function (w) and these individual time segments are Fourier transformed into individual spectral ranges.
The input into the STFT in this embodiment are each of the vocal track frames xv,1, . . . , xv,K, which are time discrete entities. Therefore, a discrete-time short time Fourier transform STFT is applied. In the following the application of the STFT to the first vocal track frame xv,1 is described (l is the index to traverse the vector x). The STFT of the first vocal track frame xv,1, using the window function w[l−m], yields a complex valued function X(m, ω), i.e. the phase and magnitude, at every discrete time step m and frequency to:
The window function w[l−m] is centred around the time step m and only has values unequal to 0 for a selected window length (typically between 25 ms and 1 second). A common window function is the rectangle function.
The squared magnitude |X(m, ω)|2 of the discrete-time short time Fourier transform X(m, ω) yields the audio event spectrogram x1 of the first vocal track frame xv,1:
x
1
:=x
1(m,ω):=|X(m,ω)|2=|STFT{x[l]}(m,ω)|2
The audio event spectrogram x1(m, ω) (in the following just denoted as x1) provides a scalar value for every discrete time step m and frequency ω and may be visually represented in a density plot as a grey-scale value. That means the audio event spectrogram x1 may be stored, processed and displayed as a grey scale image. An example of an audio spectrogram is given in
The STFT technique as described above may be applied to the complete vocal track xv or to the audio waveform xϵn.
The width of the window function w[m] determines the temporal resolution. It is important to note, that due to the Küpfmüller uncertainty relation the resolution in the time domain and the resolution in the frequency domain cannot be chosen arbitrarily fine but are bounded by product of time and frequency which is a constant value. If the highest possible resolution in the time domain is required, for example to determine the point in time when a certain signal starts or stops, this results in a blurred resolution in the frequency domain. If a high resolution in the frequency domain is necessary to determine the frequency exactly, then this results in a blur in the time domain, i.e. the exact points in time can only be determined blurred.
The shift of the window determines the resolution of the x-axes of the resulting spectrogram. The y-axis of the spectrogram shows the frequency. Thereby the frequency may be expressed in Hz or in the mel scale. The color of each point in the spectrogram is indicating the amplitude of a particular frequency at a particular time.
In this case the parameters may be chosen accordingly to the scientific paper “CNN architectures for large-scale audio classification”, by Hershey, Shawn, et al., published in 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2017. That is the vocal track xv is divided into frames with a length of 960 ms. The windows have a length of 25 ms and are applied every 10 ms. The resulting spectrogram is integrated into 64 mel-spaced frequency bins. This results in spectrograms with a resolution of 96×64 pixels. A vocal track xv with a length 4 minutes 48 seconds of yields 300 spectrograms each with a resolution of 96×64 pixels.
In another embodiment the predetermined conditions for the music source separation may be to separate the audio waveform xϵn into melodic/harmonic tracks and percussion tracks, or in another embodiment the predetermined conditions for the music source separation may be to separate the audio waveform xϵn into all different instruments like drums, strings and piano etc.
In another embodiment more than one track or another separated track than the vocal track xv may be input into the STFT unit 302.
In yet another embodiment the audio event spectrograms, which are output by the STFT 302, may be further analysed by an audio event detection unit as it is describe below in more detail at
As the separation of the audio source signal may be imperfect, for example, due to the mixing of the audio sources, a residual signal 3 (r(n)) is generated in addition to the separated audio source signals 2a-2d. The residual signal may for example represent a difference between the input audio content and the sum of all separated audio source signals. The audio signal emitted by each audio source is represented in the input audio content 1 by its respective recorded sound waves. For input audio content having more than one audio channel, such as stereo or surround sound input audio content, also a spatial information for the audio sources is typically included or represented by the input audio content, e.g. by the proportion of the audio source signal included in the different audio channels. The separation of the input audio content 1 into separated audio source signals 2a-2d and a residual 3 is performed on the basis of blind source separation or other techniques which are able to separate audio sources. The audio source separation may end here, and the separated sources may be output for further processing.
In another embodiment two or more separations may be mixed together again (e.g., if the network has separated the noisy speech into “dry speech” and “speech reverb”) in a second (upmixing) step. In a second step, the separations 2a-2d and the possible residual 3 are remixed and rendered to a new loudspeaker signal 4, here a signal comprising five channels 4a-4e, namely a 5.0 channel system. On the basis of the separated audio source signals and the residual signal, an output audio content is generated by mixing the separated audio source signals and the residual signal on the basis of spatial information. The output audio content is exemplary illustrated and denoted with reference number 4 in
The STFT unit 302 receives the audio waveform xϵn as input. The STFT 302 unit divides the receiving audio waveform xϵn into L equal-length frames, of a predetermined length. As described in the scientific paper quoted above the STFT 302, divides the receiving audio waveform xϵn in into frames with a length of 960 ms. The windows have a length of 25 ms and are applied every 10 ms. The resulting spectrogram is integrated into 64 mel-spaced frequency bins. This results in spectrograms with a resolution of 96×64 pixels. To these L frames a short time Fourier transform is applied which yields candidate spectrograms s1, . . . , sL. The candidate spectrograms s1, . . . , sL are input into the trained DNN label-classifier 501. The trained DNN label-classifier 501 comprises a trained deep neural network, which is trained as described in the scientific paper quoted above. That is, the DNN is trained to label the input spectrograms in a supervised manner (i.e. using labelled spectrograms during the learning process), wherein 30871 labels are used from the “google knowledge graph” database, for example labels like “song”, “gunshot”, or “President Donald J. Trump”. In the operational mode the trained DNN label-classifier outputs the candidate spectrograms s1, . . . , sL each provided with one or more labels (from the 30871 labels from the “google knowledge graph” database), which yields the set of labelled spectrograms s′1, . . . , s′L. The set of labelled spectrograms s′1, . . . , s′L is input into the label-based filtering 503, which only lets spectrograms from the set of spectrograms s′1, . . . s′L pass, which are part of a predetermined pass-set. The predetermined pass-set may for example include labels like “human speech” or “gunshot”, or “speech of President Donald J. Trump”. The subset of the K spectrograms of set of labelled spectrograms s′1, . . . , s′L, which are allowed to pass the label-based filtering 503, are defined as audio event spectrograms x1, . . . , xK (wherein the labels may be removed or not).
In one embodiment the deepfake detector 102 comprises a trained deep neural network (DNN) classifier, for example a convolutional neuronal network (CNN), that is trained to detect audio deepfakes. In the case that the audio event spectrograms x1, . . . , xK as output by pre-processing unit 101 are spectrograms, i.e. images (e.g. grayscale or two-channel), the deepfake detector can utilizes neural network methods and techniques which were developed to detect video/image deepfakes.
In one embodiment the deepfake detector 602 comprises one of the several different methods of deepfake image detection which are described in the scientific paper “DeepFakes and Beyond: A Survey of Face Manipulation and Fake Detection”, by Tolosana, Ruben, et al. published in arXiv preprint arXiv:2001.00179 (2020).
In another embodiment the deepfake detector comprises a DNN classifier as described in the scientific paper “CNN-generated images are surprisingly easy to spot . . . for now”, by Wang, Sheng-Yu, et al. published in arXiv preprint arXiv:1912.11035 (2019). In this embodiment convolutional neuronal networks (CNN) are used, which are a common architecture to implement DNNs for images. The training of the deepfake detector 102 for this embodiment is described in more detail in
The general architecture of a CNN for image classification is described below in
The value of each entry of the input image matrix 601 is the value of the colour of each pixel. For example, each entry of the input image matrix 601 might be a 24-bit value, wherein each of the colours red, green, and blue occupies 8 bits. A filter (also called kernel or feature detector) 602, which is a matrix (may be symmetric or asymmetric; in audio applications, it may be advantageous to use asymmetric kernels as the audio waveform—and therefore also the spectrogram—may be not symmetric). with an uneven number of rows and columns (for example 3×3, 5×5, 7×7 etc.), is shifted from left to right and top to bottom such that the filter 602 is once centred over every pixel. At every shift the entries of the filter 602 are elementwise multiplied with the corresponding entries in the image matrix 601 and the result of all elementwise multiplication are summed up. The result of the summation generates the entry of a first layer matrix 603 which has the same dimension as the input image matrix 601. The position of the centre of the filter 602 in the input image matrix 601 is the same position where the generated result of the multiplication-summation as described above is placed in the first layer matrix 603. All rows of the first layer matrix 603 are placed next to each other to form a first layer vector 604. A nonlinearity (e.g., ReLU) may be placed between the first layer matrix 603 (convolutional layer) and the first layer vector 604 (affine layer). The first layer vector 604 is multiplied with a last layer matrix 605, which yields the result z. The last layer matrix 605 has as many rows as the first layer vector has columns and the number of S columns of the last layer vector corresponds to the S different classes into which the CNN should classify the input image matrix 601. For example, S=2, i.e. the image corresponding to the input image matrix 601 should be classified as either fake or real. The result z of the matrix multiplication between the first layer vector 604 and the last layer matrix 605 is input into a Softmax function. The Softmax function is defined as
with i=1, . . . , S, which yields a probability distribution over the S classes, i.e. the probability for each of the S different classes into which the CNN should classify the input image matrix 601, which is in this case the probability Preal that the input image matrix 601 corresponds to a real image and the probability Pfake that the input image matrix 601 corresponds to a deepfake image. For binary classification problems, i.e. S=2, only one output neuron with a sigmoid nonlinearity may be used and if the output is below 0.5 the input may be labeled as class 1 and if it is above 0.5 the input may be labeled as class 2.
The entries of the filter 602 and the entries of the of the last layer matrix 605 are the weights of the CNN which are trained during the training process (see
The CNN can be trained in a supervised manner, by feeding an input image matrix, which is labelled as either corresponding to a real image or a fake image, into the CNN. The current output of the CNN, i.e. the probability of the image being real or fake is input into a loss function and through a backpropagating algorithm the weights of the CNN are adapted.
The probability Pfake that an input image is a classified as a deepfake by the trained classifier is also denoted as the fake probability value of a trained DNN classifier Pfake,DNN i.e. Pfake,DNN=Pfake.
There exist several variants of the general CNN architecture described above. For example, multiple filters in one layer can be used and/or multiple layers can be used.
As described above in one embodiment the deepfake detector uses the DNN classifier as described in the scientific paper “CNN-generated images are surprisingly easy to spot . . . for now”, by Wang, Sheng-Yu, et al. published in arXiv preprint arXiv:1912.11035 (2019). In this case the Resnet 50 CNN pretrained with ImageNet is used in a binary classification setting (i.e. the spectrogram is real of fake). The training process of this CNN is described in more detail in
Many deepfakes are generated with Generative Adversarial Networks (GANs). GANs consist of two artificial neural networks that perform a zero-sum game. One of them creates candidates (the generator), the second neural network evaluates the candidates (the discriminator). Typically, the generator maps from a vector of latent variables to the desired resulting space. The goal of the generator is to learn to produce results according to a certain distribution. The discriminator, on the other hand, is trained to distinguish the results of the generator from the data of the real, given distribution. The objective function of the generator is then to produce results that the discriminator cannot distinguish. In this way, the generated distribution should gradually adjust to the real distribution. There exists many different implementations and architectures of GANs.
As described in the above quoted scientific paper although the CNN in the deepfake detector 102 is only trained with deepfake spectrograms generated with one artificial intelligence techniques, for example the GAN architecture ProGAN, it is able to detect deepfake spectrograms generated from several different models.
In another embodiment the CNN in the deepfake detector 102 may be trained with deepfakes which are generated with another model than with ProGAN, or the CNN in the deepfake detector 102 may be trained with deepfakes which are generated with several different models.
In another embodiment the deepfake spectrograms of the large-scale database used for training of a DNN deepfake classifier may be generated by applying audio-changing artificial intelligence techniques directly to audio files and then transforming them by means of STFT into a deepfake spectrogram.
The error may be determined by calculating the error between the probability output by the Softmax function and the label of the image. For example if the image was labelled “real” and the probability output of the Softmax function for being real is Preal and for being a deepfake is Pfake then the error may be determined as
Through backpropagation, for example with a gradient descent method, the weights are adapted based on the error. The probability Pfake that an input image is classified as a deepfake by the trained classifier is also denoted as the output value of the trained DNN classifier PDNN, i.e. PDNN=Pfake.
If more than one audio event spectrogram is input into the deepfake detector 102 comprising a trained DNN classifier the same process as described in
The problem of detecting a deepfake may be considered from generator-discriminator perspective (GANs). That means that a generator tries to generate deepfakes and a discriminator, i.e. the deepfake detector 102 comprising a DNN classifier as described above, tries to identify the deepfakes. Therefore, it may happen that an even more powerful generator might eventually fool the discriminator (for example after being trained for enough epochs), i.e. the deepfake detector 102 comprising a DNN classifier as described above. Therefore, the deepfake detector 102 comprising a DNN classifier as described above might be extended by different deepfake detection methods.
Still further in another embodiment the deepfake detector 102 comprises additionally to the DNN classifier as described above or instead of the DNN classifier as describe above an estimation of an intrinsic dimension of the audio waveform xϵn (see
Still further in another embodiment the deepfake detector 102 comprises additionally to the DNN classifier as described above or instead of the DNN classifier as describe above a disparity discriminator (see
The intrinsic dimension (also called inherent dimensionality) of a data vector v (for example an audio waveform or an audio event) is the minimal number of latent variables needed to describe (represent) the data vector v (see details below).
This concept of the intrinsic dimension, with an even broader definition based on a manifold dimension where the intrinsic dimension does only need to exist locally, is also described in the textbook “Nonlinear Dimensionality Reduction” by Lee, John A., Verleysen, Michel, published in 2007.
Usually, real world datasets, for example a real-world image, have large numbers of (data) factors, often significantly greater than the number of latent factors underlying the data generating process. Therefore, the ratio between the number of features of a real dataset (for example a real spectrogram) and its intrinsic dimension can be significantly higher than then ratio between the number of features of deepfake dataset (for example a deepfake spectrogram) and its intrinsic dimension.
The estimation of an intrinsic dimension of an image (for example a spectrogram) is described in the scientific paper “Dimension Estimation Using Autoencoders”, by Bahadur, Nitish, and Randy Paffenroth, published on arXiv preprint arXiv:1909.10702 (2019). In this scientific paper an autoencoder is trained to estimate the intrinsic dimension of an input image.
An autoencoder is a type of artificial neural network used to learn efficient data codings in an unsupervised manner. The aim of an autoencoder is to learn a (latent) representation (encoding) for a set of data by training the network to ignore signal “noise”. Along with the reduction side (encoder), a reconstructing side (decoder) is learnt, where the autoencoder tries to generate from the reduced encoding a representation as close as possible to its original input, hence its name. One variant of an autoencoder is a feedforward, non-recurrent neural network similar to single layer perceptrons that participate in multilayer perceptrons (MLP)—having an input layer, an output layer and one or more hidden layers connecting them—where the output layer has the same number of nodes (neurons) as the input layer, and with the purpose of reconstructing its inputs (minimizing the difference between the input and the output) instead of predicting the target value Y given inputs X. Therefore, autoencoders are unsupervised learning models (do not require labelled inputs to enable learning).
A latent space dimension is defined as the number of nodes in the hidden layer (latent space) in an autoencoder.
A feature space dimension is defined as the number of input nodes in the input layer in an encoder of an autoencoder, for example number of pixels of a spectrogram.
In the training mode, the autoencoder 900 is trained with different deepfake spectrograms and real spectrograms and learns a latent representation of the input deepfake spectrograms and real spectrograms. From this latent representation of the input spectrograms the intrinsic dimension of the input image can be estimate as described in scientific paper “Dimension Estimation Using Autoencoders”, by Bahadur, Nitish, and Randy Paffenroth, published on arXiv preprint arXiv:1909.10702 (2019).
In operational mode the trained autoencoder 900 outputs an estimated intrinsic dimension dimint of an input spectrogram.
of the intrinsic dimension dimint of the input audio event spectrogram x1 and feature space dimension dimfeat of the input audio event spectrogram x1 is determined. In step 1004, an intrinsic dimension probability value Pintrinsic−ƒintrinsic(rdim) of the input audio event spectrogram x1 is determined based on the ratio rdim of the intrinsic dimension dimint and the an intrinsic dimension probability function ƒintrinsic. In step 1005, a deepfake probability Pdeepfake=Pintrinsic is determined as the intrinsic dimension probability value Pintrinsic.
The intrinsic dimension probability function ƒintrinsic may be a piecewise-defined function, which may be defined as:
If more than one audio event spectrogram is input into the deepfake detector 102 comprising an intrinsic dimension estimator, the same process as described in
of the intrinsic dimension dimint of the input audio event spectrogram x1 and feature space dimension dimfeat of the input audio event spectrogram x1 is determined. In step 1104, an intrinsic dimension probability value Pintrinsic−ƒintrinsic(rdim) of the input audio event spectrogram x1 is determined based on the ratio rdim of the intrinsic dimension dimint and the an intrinsic dimension probability function ƒintrinsic. In step 1105, a fake probability value Pfake,DNN of a trained DNN classifier for the input audio event spectrogram x1 of being a deepfake is determined, as described in
In another embodiment, deepfake probability Pdeepfake for the input audio event spectrogram x1 is determined as the maximum of the intrinsic dimension probability value Pintrinsic and the fake probability value Pfake,DNN of the trained DNN classifier: Pdeepfake=max{Pfake,DNN, Pintrinsic}.
If more than one audio event spectrogram is input into the deepfake detector 102 comprising a DNN deepfake classifier and an intrinsic dimension estimator, the same process as described in
The deepfake detector 102 can comprise a disparity discriminator. A disparity discriminator can discriminate a real audio event from a fake audio event by comparing pre-defined features or patterns of an input audio waveform (or an audio event) to the same pre-defined features or patterns of a stored real audio waveform. That works, because it can be observed that there are disparities for certain properties between real audio events and deepfake audio events.
In one embodiment the disparity discriminator of the audio deepfake detector 102, can discriminate between a real audio event and a deepfake audio event by comparing (for example by a correlation, see
The correlation probability function ƒcorr is defined as:
In another embodiment the disparity discriminator of the audio deepfake detector 102, can discriminate between a real audio event and more than one recording noise floors of more than one stored real audio event (e.g., for different recording studios). In this case instead of the term corr({tilde over (x)}1, y) the term
corr({tilde over (x)}1, y) is used.
In another embodiment the disparity discriminator of the audio deepfake detector 102, can discriminate between a real audio event and a deepfake audio event by comparing (for example by a correlation) (patterns of) of a quantization noise floor (also called artefacts) of an input audio event to a quantization noise floor of a stored real audio event. That is because real vocal signals are recorded with a (analog) microphone and the conversion from an analog signal to a digital signal (A/D conversion) through a quantization process results in a quantization noise floor in the real vocal signal. This quantization noise floor has a specific pattern which can be detected, for example by comparing the quantization noise floor pattern of the input waveform to quantization noise floor pattern a stored real audio waveform, for example by applying a cross-correlation as explained above to the spectrogram of the input audio event spectrogram and to a stored spectrogram of a real audio event which comprises a typical quantization noise floor. If the input audio event is a music piece the vocal track of the input audio event can be separated from the rest of the music piece (see
Or in another embodiment an artificial neural network can be trained specifically to discriminate the disparities of the recording noise floor feature(s) and the quantization noise floor feature(s) between a real spectrogram and a deepfake spectrogram.
In yet another embodiment disparities for certain properties between real audio event spectrograms and deepfake audio event spectrograms may be visible in one or more differing features of a learned latent representation. A latent representation of a spectrogram of an audio waveform may be obtained by the use of an autoencoder, as described above in
Still further in another embodiment the deepfake detector 102 comprises additionally to the DNN classifier as describe above in
In another embodiment, deepfake probability Pdeepfake for the input audio event spectrogram x1 is determined as the maximum of the correlation probability value Pcorr and the fake probability value Pfake,DNN of a trained DNN classifier: Pdeepfake=max{Pfake,DNN, Pcorr}.
If more than one audio event spectrogram is input into the deepfake detector 102 comprising a DNN deepfake classifier and an intrinsic dimension estimator, the same process as described in
Still further in another embodiment the deepfake detector 102 comprises additionally to DNN classifier as describe above in
In another embodiment, deepfake probability Pdeepfake for the input audio event spectrogram x1 is determined as the maximum of the correlation probability value Pcorr and the fake probability value Pfake,DNN and the intrinsic dimension probability value Pintrinsic: Pdeepfake=max{Pfake,DNN, corr, Pintrinsic}.
If more than one audio event spectrogram is input into the deepfake detector 102 comprising a DNN deepfake classifier and an intrinsic dimension estimator, the same process as described in
In the embodiment of
In one embodiment the combination unit combines them into an overall deepfake probability Pdeepfake,overall of the audio waveform x as Pdeepfake,overall=max{Pdeepfake,1, . . . , Pdeepfake,K}.
In another embodiment a refinement is taken into account by weighing the deepfake probabilities Pdeepfake,1, . . . , Pdeepfake,K for the respective audio events x1, . . . , xK with respective weights w1, . . . , wK>0. For example the audio events which contain speech, may be weighted higher.
The overall deepfake probability Pdeepfake,overall of the audio waveform x is determined as Pdeepfake,overall=Σk=1KwkPk/Σk=1Kwk.
The overall deepfake probability Pdeepfake,overall of the audio waveform x is output from the combination unit 103 and input into into a information overlay unit 104.
The information overlay unit 104 receives a deepfake probability of an audio file and the audio file itself and generates a warning message which is overlaid over the audio file, which yields a modified audio file which is output by the deep fake detector smart loudspeaker system 100.
The information overlay unit 104 can computer-generate a warning message xwarning, which can have the same format as the audio waveform xϵn. The warning message xwarning can comprise a computer-generated speech message announcing the calculated deepfake probability Pdeepfake,overall of a audio waveform x or deepfake probability Pdeepfake of the audio event x1. The warning message xwarning can instead or additionally comprise a computer-generated general warning speech message like “This audio clip is likely a deepfake.”. The warning message xwarning can instead or additionally comprise a computer-generated play-out specific warning message like “The following audio clip contains a computer-generated voice that sounds like President Donald J. Trump”, or “The following audio clip is a deepfake with an estimated probability of 75%”. The warning message xwarning can instead or additionally comprise a play-out warning melody.
In the embodiment of
In the embodiment of
If the audio waveform xϵn is received by the information overlay unit 104 in off-line mode the warning message xwarning can be overlaid over the audio waveform xϵn by merging the warning message xwarning with the audio waveform xϵn, at any given time of audio waveform xϵn (i.e. before, during or after the audio waveform xϵn), which yields a modified audio waveform x′ϵn. The warning message xwarning can be played with a higher amplitude than the audio waveform xϵn in the modified audio waveform x′ϵn, for example with the double amplitude. The audio waveform xϵn can also be cut at any given part and the warning message xwarning is inserted, which yields the modified audio waveform x′ϵn.
If the audio waveform xϵn is received by the information overlay unit 104 in on-line mode the warning message xwarning can be overlaid over the currently played audio waveform xϵn by live-merging (i.e. the currently played audio waveform xϵn is buffered for a time period and merged with warning message xwarning) the warning message xwarning with the currently played audio waveform xϵn. The warning message xwarning can be played with a higher amplitude than the audio waveform xϵn in the modified audio waveform x′ϵn, for example with the double amplitude. The currently played audio waveform xϵn can also be paused/cut and the warning message xwarning is inserted, which yields the modified audio waveform x′ϵn.
In another embodiment, the information overlay unit 104 may output a warning light (turning it on) while playing the audio waveform xϵn, if the overall deepfake probability Pdeepfake,overall of the audio waveform xϵn or the deepfake probability Pdeepfake of the audio event x1 is above a pre-determine threshold, for example 0.5.
In another embodiment a screen display may display the overall deepfake probability Pdeepfake,overall of the audio waveform xϵn or the deepfake probability Pdeepfake of the audio event x1.
In another embodiment a screen display my display a trust level of audio waveform xϵn, which may be the inverse value of the deepfake probability Pdeepfake,overall of the audio waveform xϵn or the deepfake probability Pdeepfake of the audio event x1.
In another embodiment the audio waveform xϵn may be muted completely if the deepfake probability Pdeepfake,overall of the audio waveform xϵn or the deepfake probability Pdeepfake of the audio event x1 exceeds a certain threshold, for example 0.5. In another embodiment parts of the audio waveform xϵn for which a deepfake probability Pdeepfake exceeds a certain threshold, for example 0.5, are muted. In another embodiment separated tracks of the audio waveform xϵn for which a deepfake probability Pdeepfake exceeds a certain threshold, for example 0.5, are muted.
Instead of implementing the detection pipeline directly on the chip/silicon level, it would also be possible to implement it as part of the operating system (video/audio driver) or part of the internet browser. For example, the operating system or browser may constantly check the video/audio output the system such that it can automatically detect possible deepfakes and warn the user accordingly.
It should be recognized that the embodiments describe methods with an exemplary ordering of method steps. The specific ordering of method steps is, however, given for illustrative purposes only and should not be construed as binding. For example, steps 1401, 1402 or 1403 in
It should also be noted that the division of the electronic device of
All units and entities described in this specification and claimed in the appended claims can, if not stated otherwise, be implemented as integrated circuit logic, for example, on a chip, and functionality provided by such units and entities can, if not stated otherwise, be implemented by software.
In so far as the embodiments of the disclosure described above are implemented, at least in part, using software-controlled data processing apparatus, it will be appreciated that a computer program providing such software control and a transmission, storage or other medium by which such a computer program is provided are envisaged as aspects of the present disclosure.
Note that the present technology can also be configured as described below:
Number | Date | Country | Kind |
---|---|---|---|
20189193.4 | Aug 2020 | EP | regional |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/EP2021/071478 | 7/30/2021 | WO |