Described herein are mechanisms for watermarking of speech signals.
Many systems and applications are speech enabled, allowing users to interact with the system via speech. Speech is sometimes used to authenticate users via voice biometrics, phrases, etc. However, with developments in text-to-speech (TTS) technologies, synthetic speech is becoming difficult to detect. In order to prevent unauthorized copying of speech signals or the use of synthetic speech signals, the speech signals may be encoded with certain watermarking. Current watermarking techniques may not ensure appropriate authentication of speech signals, or the quality of the audio signal may suffer.
A method for applying a watermark signal to a speech signal to prevent unauthorized use of speech signals, the method may include receiving an original speech signal; determining a corresponding spectrogram of the original speech signal; selecting a phase sequence of fixed frame length and uniform distribution; and generating an encoded watermark signal based on the corresponding spectrogram and phase sequence.
In a further embodiment, the method includes taking the magnitude of the original speech spectrogram to generate the encoded watermark.
In another embodiment, the spectrogram is determined by applying a short-time Fourier transform (STFT) to determine the sinusoidal frequency and phase content of each frame of the original input signal.
In a further embodiment, the method includes applying bit encoding prior to generating the encoded watermark.
In another embodiment, the bit encoding includes assigning bits based on information about the original speech signal.
In a further embodiment, the bit encoding is spread out through a subset of frequency bins to allow for detection of the bit encoding in adverse conditions.
In another embodiment, the method includes comprising determining a frequency dependent gain factor based at least in part on a frequency of the original speech signal.
In a further embodiment, the frequency dependent gain factor is based on at least one frequency threshold, where a first gain factor is selected for frequencies below a first threshold frequency, and where a second gain factor is selected for frequencies above a second threshold frequency.
In another embodiment, a transition gain factor is selected for frequencies between the first threshold frequency and the second threshold frequency.
In a further embodiment, the method includes storing the encoded watermark for authenticating a future speech signal, the encoded watermark defining permissions for use of the future speech signal.
In another embodiment, the method includes adding at least one of a pretty good privacy (PGP) or public key cryptography to the watermark signal.
In a further embodiment, the watermark signal includes words spoken in the original speech signal, wherein each word is associated with a sequence position.
In another embodiment, the watermark signal includes a start and end time for each word as spoken in the original speech signal.
A non-transitory computer readable medium comprising instructions for applying a watermark signal to a speech signal to prevent unauthorized use of speech signals that, when executed by a processor, causes the processor to perform operations may include to receive an original speech signal; determine a corresponding spectrogram of the original speech signal; select a phase sequence of fixed frame length and uniform distribution; generate an encoded watermark signal based on the corresponding spectrogram and phase sequence.
In another embodiment, the processor is programmed to perform operations further comprising to take the magnitude of the spectrogram to generate the encoded watermark.
In a further embodiment, the spectrogram is determined by applying a short-time Fourier transform (STFT) to determine the sinusoidal frequency and phase content of each frame of the original input signal.
In another embodiment, the processor is programmed to perform operations further comprising to apply bit encoding prior to generating the encoded watermark.
In a further embodiment, the bit encoding includes assigning bits based on information about the original speech signal.
A method for applying a watermark signal to an audio signal including speech content to prevent unauthorized use of the speech content, the method may include receiving an original audio signal having speech content; generating an encoded watermark signal based on the original speech signal, the encoded watermark signal defining allowed usage of the original audio signal; and transmitting an encoded audio signal including the original audio signal and watermark signal.
The embodiments of the present disclosure are pointed out with particularity in the appended claims. However, other features of the various embodiments will become more apparent and will be best understood by referring to the following detailed description in conjunction with the accompany drawings in which:
With the increased quality of text-to-speech technology, voice avatars could be used to trick a voice-biometric based security mechanism, or to send messages in the name of someone else. In order to prevent this, speech signals can be encoded with a watermark that contains extra information, for instance, whether the speech originates from a real person or a cloned voice, the native language of the voice's speaker, gender, and so forth. The watermark is mostly inaudible so that the speech quality is not reduced.
On the receiving side, a decoder may detect the watermark and read out the information within the watermark. The decoder may, for example, be used for authenticating the voice in a speech signal for voice biometrics or messaging and communication applications. The watermark may be a pseudo-random watermark sequence added to the speech signal in the frequency domain. The magnitude may be controlled by the magnitude of the speech signal. Because of this, the watermark is concentrated at those locations in the spectrum where a modification of the speech signal would probably be audible. This allows the watermark system to thwart off attacks such as including noise in the signal or encoding the signal with a lossy audio codec.
Further, adding the watermark in the frequency domain also allows for sending different parts of the information contained in the watermark in different frequency bands, or duplicate the watermark's information across multiple frequency bands to make it harder to tamper with the watermark.
Splicing attacks may be attempted when an unauthorized user may cut certain words or phrases from a speech signal and rearrange the splices to create a new audio message out of the various clips. The watermark may contain the words of the audio message in text form, in their order in the utterance. For each word token in this string the watermark may furthermore contain information about the sentence position where each word was spoken—as token number and/or by indicating start and end time for each word in the sentence. Because the watermark is still present in each clip, the watermark may prevent the unauthorized splicing, preventing splicing attacks. Additionally or alternatively, a counter may be added to the encoded information that regularly increases in a given time interval to further make copying or splicing detectable.
The watermark may include information about the speaker ID, speaking situation, allowed usage, and/or authentication certificate or token, such as pretty good privacy (PGP), public key cryptography, etc. The certification process may thus work in two parts, the voice signal authentication token may only be used by an authorized identity to create a certified voice sample, and people who have been given access to receive and listen to the voice signal may authenticate it per the—possibly encrypted—certificate that is part of the watermark and an additional security token such as a public key.
The voice usage certificate or watermark may contain information about the allowed use of the voice. For example, the voice owner may specify that the voice may only be used for reading out messages that he sends, but not as a voice for a generic voice assistant. The watermark may also specify whether the speaker's artificial voice may be used to read out profanity or not and have an explicit list of blacklisted words that may not be spoken by the voice.
In another and specific example of the necessity to watermark signals, a world leader may present a speech and instructs the military to protect a refugee corridor. The world leader may add a watermark to the audio and/or video to authorize this audio stream/recording. When a receiver, which may be a private viewer, government official, foreign statesperson, military officer, or a news agency, receives the content, they run the authentication process to see that the audio is legit. On the other hand, if evil propaganda machinery produces a fake recording with the leader's voice saying he doesn't really care and just wants to play golf, it will not carry that authentication token and can therefore not be assumed to be real.
Accordingly, a watermarking system is described herein with the ability to be inaudible for speech signals, while also being robustly secure against various avenues of attack.
The watermarking system 100 may be described herein as being specific to human speech signals, but may generally be applied to other types of audio signals, such as music, signing, etc. In some examples, the watermarking system 100 may be applicable within vehicles, as well as other systems to verify speech signals prior to granting access to or generating TTS voice signals. In other examples, the system 100 may be applied to video content as well.
The watermarking system 100 may include a processor 106. The processor 106 may execute instructions for certain applications, including a watermark application 116. Instructions for the watermark application 116 may be maintained in a non-volatile manner using a variety of types of computer-readable storage medium 104. The computer-readable storage medium 104 (also referred to herein as memory 104, or storage) includes any non-transitory medium (e.g., a tangible medium) that participates in providing instructions or other data that may be read by the processor 106. Computer-executable instructions may be compiled or interpreted from computer programs created using a variety of programming languages and/or technologies, including, without limitation, and either alone or in combination, Java, C, C++, C#, Objective C, Fortran, Pascal, Java Script, Python, Perl, and PL/structured query language (SQL).
The watermarking system 100 may include a speech generator 108. The speech generator 108 may generate synthetic speech signals such as voice avatars based on previously acquired human speech signals. The speech generator 108 may use TTS systems, as well as other types of speech generators. The speech generator 108 may use voice transformation techniques, including spectral mapping to match certain target voices.
The watermarking system may include at least one microphone 112 configured to receive audio signals from a user, such as acoustic utterances including spoken words, phrases, passwords, or commands from a user. In the example where the system is within a vehicle, the microphone 112 may be used for other vehicle features such as active noise cancelation, hands-free interfaces, wake up word detection, etc. The microphone 112 may facilitate speech recognition from audio received via the microphone 112 according to grammar associated with available commands, and voice prompt generation. The microphone 112 may, in some implementations, include a plurality of sound capture elements, e.g., to facilitate beamforming or other directional techniques.
A user input mechanism 110 may be included, in that a voice owner or user may utilize the user input mechanism 110 to enter preferences associated with the watermarking system 100. An authenticated user may be an individual who is permitted to use the voice of the voice owner to read out messages or one who is permitted to receive the voice message, etc. The voice owner or user may be the originator (i.e., the person speaking in the recording or the person whose voice clone was created.) That is, the voice owner or user may have the ability to enter allowed usage of the user's voice. For example, the user may allow the voice to be used for reading out messages, but not as voice for a generic voice assistant, or to be used for biometric authentication. Other settings, such as allowing the voice to read out profanity, or adding blacklisted words to a list of words to prevented to be spoken. These user preferences may be used to generate the watermark, as described in more detail herein. Further, in some examples, the watermark may contain the words of the audio message in text form, in their order in the utterance. For each word token in this string the watermark may furthermore contain information about the sentence position where each word was spoken—as token number and/or by indicating start and end time for each word in the sentence.
The user input mechanism 110 may include a visual interface, such as a display on a user mobile device, computer, vehicle display, etc. The user input mechanism 110 may facilitate user input via a specific application that provides a user friendly interface allowing for selectable options, or customizable features. The user input mechanism 110 may also include an audio interface, such as a microphone capable of audibly receiving commands related to permissions and preferences for voice usage.
The watermark application 116 is configured to receive speech signal information or data from the memory 104, processor 106, speech generator 108, user input mechanism 110 and/or microphone 112 and generate a watermark to be added to a speech signal. The speech signal may be provided by the speech generator 108 or the microphones 112. The watermark application 116 is configured to generate and embed an audio watermark signal into the speech signal and output an output signal. The output signal may include the speech signal and the watermark, though the watermark is imperceptible to the human ear and does not degrade the speech signal. Moreover, it is designed such that it cannot be removed easily from the speech signal without destroying or at least seriously degrading it, such that use of the voice for unauthorized purposes can be detected or prevented by not allowing playback by the audio hardware/software. The output signal may be transmitted via a speaker (not shown), or may be recorded or saved for later use.
The watermark application may generate and maintain a watermark certificate 118 associated with the speech signal. The certificate 118 may be (or may otherwise include) the generated watermark. The watermark certificate 118 may be maintained separate from the output signal into which the watermark is embedded and may be used by a third party to determine whether a speech signal is authorized or not. That is, a recipient that is in possession of the certificate 118 may utilize the certificate 118 to determine whether a speech signal is genuine or unaltered, or whether it has been copied, reproduced, spliced, etc. In an example, the recipient may compare a digital footprint of the speech signal with the watermark certificate 118. Only authorized third parties may receive the certificate 118.
The certificate 118 may be generated based on the speech signal, including the magnitude of the speech signal, phase information, gain factors, user preferences, etc. That is, the certificate, or watermark, may be specific to each speech signal. This may allow for a higher degree of security as well as a better speech signal audio that is undisturbed by the addition of the watermark.
The watermark application 116, via the processor 106, or other specific processor, may transmit the certificate to a third party decoder 122. This may be achieved via a communication network 120. The communication network 120 may be referred to as a “cloud” and may involve data transfer via wide area and/or local area networks, such as the Internet, cellular networks, Wi-Fi, Bluetooth, etc. The communication network 120 may provide for communication between the watermark application 116 and the third party decoder 122. Further, the communication network 120 may also be a storage mechanism or database, in addition to the cloud, hard drives, flash memory, etc. The third party decoder 122 may be implemented on a remote server or otherwise external to the watermark application 116. While one decoder 122 is illustrated, more or fewer decoders 122 may be included, and the user may decide to send the certificate 118 to more than one third party, allowing more than one third party to authenticate speech signals based on the watermark. The third parties may also receive the watermark certificate 118 and decode the certificate 118 to denote user preferences for the use of the user's speech signal.
The watermarking system 100, including the processor 106, watermark application, 116, decoder 112, as well as other components, may include one or more computer hardware processors coupled to one or more computer storage devices for performing steps of one or more methods as described herein and may enable the watermark application 116 to communicate and exchange information and data with systems and subsystems external to the application 116 and local to or onboard the vehicle application. The system 100 may include one or more processors 106 configured to perform certain instructions, commands and other routines as described herein.
As explained, while automotive systems may be discussed in detail here, other applications may be appreciated. For example, similar functionality may also be applied to other, non-automotive cases. In one example, the functionality may be used for the verification of speech input to a smart speaker device. In another example, the functionality may be used for input to a smartphone. In yet another example, the functionally may be used for verification of speech input to a security system.
The watermark application 116 may determine a phase sequence θ(m,w), where m=1, . . . T. The phase sequence θ(m,w) is a multi-frame random sequence of fixed frame length T with uniform distribution in [0, . . . . 27π]. This sequence is chosen once by the watermark application and kept secret. The sequence may be randomly selected from a library of possible sequences, or may be randomly generated for each watermark.
The watermark application 116 may generate the (n,w), n=1,2,3, . . . obtained from the magnitude of the corresponding spectrogram X(n,w) of the original speech signal and the phase sequence θ(m,w), according to:
(n,w)=|X(n,w)|·exp(iθ(mod(n,T), w))),
where mod is the modulus operator, i.e. the remainder during division of n by T.
For a high robustness watermark, the magnitude of the watermark spectrum should be as high as possible, but should also stay below the level where it becomes audible. Thus, a lower watermark magnitude may be used in lower frequencies of the original speech signal where the human hearing system is more sensitive to phase distortions.
While not expressly shown, it should be noted that the watermark may use/contain? an additional authentication certificate or token, such as pretty good privacy (PGP), public key cryptography, etc.
Y(n,w)=X(n,w)+a(w)·(n,w),
where a(w) may be a curve that is 0.1 (corresponding to an attenuation of −20 dB) for frequencies <1000 Hz, and
where a(w) may be a curve that is 0.5 (corresponding to an attenuation of −5 dB) for frequencies >3000 Hz,
with a transition in the dB scale in between.
For example:
That is, the gain factor may vary based on the frequency, where a first gain factor may be used for frequencies below a first threshold frequency, and where a second gain factor may be used for frequency above a second threshold frequency. A transition gain factor may be used for frequencies between the first threshold frequency and the second threshold frequency.
Thus, the frequency dependent gain factor a(w) may be used to generate the watermark signal and may be based on the frequency to create a watermark spectrum that is as high as possible, but still stays below the audible level.
Each bit may be encoded by shifting the watermark phase by π for b = 1 and using the original watermark phase for b = 0. That is, the bits are represented and detected via phase shifting and if needed, translated into the bit assignments for decoding. For example:
This bit encoding may allow for cryptographic enhancement to be integrate, for example, by scrambling bits or by scrambling the frequency assignment as described below. Scrambling in this context could include choosing different frequency permutations for each entire encoding run, for each frame, or for a fixed number of frames.
The above bit assignments may be generalized by not just considering phase shifts of 0 and pi, but also having a quantization to e.g. pi/4 in the event eight bits are encoded instead of two and values per frequency omega (i.e. 3 bits instead of 1). This shows resemblance to a modulation technique called “phase shift keying” (PSK). The equation shown above for encoding 1 bit is related to binary PSK.
Frequencies may be grouped into separate frequency subsets Ω1, Ω2, Ω3, Ω4, each associated with the respective bit b, e.g., b1 is encoded into the frequencies contained in Ω1, b2 is encoded into the frequencies contained in Ω2, and so on. For example:
This may allow for a more robust bit detectability during decoding, while allowing for several bits b = (b1, b2, b3, b4) to be encoded into one frame. As shown in
In the example shown in
Referring back to
As explained above, audio signals may be used for voice biometric authentication, repeated or reading messages in a certain voice, etc. Such authentication and watermarking may be appreciated by public figures who speak in public often and are often recorded. Such watermarking may prevent the unauthorized copying, splicing, etc., of their respective voices.
In some examples, the watermark application 116 may transmit the certificate to the decoder 122 in parallel with generating the encoded watermark signal and output signal. In another example, the decoder 122 may request access for the certificate and then the watermark application 116 may transmit the certificate upon recognizing the decoder 122. In some instances, parts of the watermark signal may still remain secret to the decoder 122 or third parties.
At block 710, the watermark application 116 may determine a corresponding spectrogram X(n,w), based on the original speech signal x(t).
At block 715, the watermark application 116 may select the phase sequence θ(m,w). Notably, the phase sequence may be kept as a secret.
At block 720, the watermark application 116 may determine the frequency-dependent gain factor a(w), where a(w) may be a curve that is 0.1 (corresponding to an attenuation of −20 dB) for frequencies w < 1000 Hz and where a(w) may be a curve that is 0.5 (corresponding to an attenuation of −5 dB) for frequencies > 3000 Hz, with a transition in the attenuations therebetween.
At block 725, the watermark application 116 may apply bit encoding to indicate various properties about the speech signal, including voice type and voice name, for example. The bit encoding may be spread out over a subset of frequency bins to allow detection in adverse conditions. The bit encoding may be achieved by shifting the watermark phase by π for b=1 and using the original watermark phase for b=0:
At block 730, the watermark application 116 may generate the encoded watermark signal (n,w,b) based on at least a subset of the spectrogram X(n,w), phase sequence θ(m,w), gain factors a(w), and bit encoding. In one example, the watermark application may take the magnitude of the original speech signal X(n,w) to generate the watermark signal. For example:
(n,w)=|X(n,w)|·exp(iθ(mod(n,T),w))
In another example, as explained in block 725, bit encoding may also be used to generate the watermark signal (n,w,b).
At block 735, the watermark application 116 may generate the output signal by applying the encoded watermark signal (n,w,b) to the original speech signal X(n,w):
Y(n,w)=X(n,w)+a(w)·(n,w,b)
The process 700 may then end.
The process 700 may be carried out by the processor 106 or another processor specific or shared with the watermark application 116. The watermark signal may be generated based on one or more factors and signals, and may omit one of more of the bit encoding, gain factor, phase sequence, etc., as discussed above.
At block 810, the decoder 122 may receive the certificate or watermark signal. At block 815, the decoder may compare the audio signal with the certificate.
At block 820, the decoder 122 may determine whether the audio signal includes the encoded watermark signal. This may be done by comparing the certificate 118 with the audio signal to see if the audio signal includes the certificate. If the decoder 122 determines that the encoded watermark signal is present in the audio signal, the process 800 proceeds to block 825. If not, the process 800 proceeds to block 830.
At block 825, the decoder 122 may authorize access to authenticate the audio signal based on the presence of the watermark signal. This may allow the audio signal to be transmitted, played, etc.
At block 830, in the absence of a watermark signal or in case unauthorized use of a watermarked voice signal, the decoder 122 may deny access or authentication and may transmit messages or instructions indicating the unauthorized use of the audio signal.
The process 800 may then end.
While the methods refer to audio signals, it is to be understood that other content and signals may benefit from the watermark application 100 and the processes described herein. For example, the processes may be applied to pictorial signals such as video signals to prevent against fake videos. The watermark may be applied to the image data within a video stream, though the audio content of the video may also benefit from watermarking at the same time. Further, in the example of a synthetic voice recording or human speech, the receiver may receive the message, e.g., a TTS voice sample, a clone voice, a human voice recording, a video, etc. The watermark may be used to verify that such a recording is authentic or validated. In this example, the decoder 112 may determine whether the audio signal includes a watermark and if so, may extract the watermark. The decoder may then validate the watermark. This may be done in one of several ways. First, the system may present the content of the watermark to the user (e.g., type of audio: human recording, clone voice, etc.; word sequence that the audio should produce, identity of the speaker, date of the recording, certificate/encrypted token, etc.). The user may then determine whether this watermark is valid.
Second, the decoder may determine whether the certificate and/or tokens of the sender are valid/match. Third, automatic speech recognition may be used to automatically check whether the spoken words in the audio file match the word sequence that is part of the watermark.
The descriptions of the various embodiments have been presented for purposes of illustration, but are not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments.
Aspects of the present embodiments may be embodied as a system, method or computer program product. Accordingly, aspects of the present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a “module” or “system.” Furthermore, aspects of the present disclosure may take the form of a computer program product embodied in one or more computer readable medium(s) having computer readable program code embodied thereon.
Any combination of one or more computer readable medium(s) may be utilized. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (erasable programmable read-only memory (EPROM) or Flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
Aspects of the present disclosure are described above with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general-purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, enable the implementation of the functions/acts specified in the flowchart and/or block diagram block or blocks. Such processors may be, without limitation, general purpose processors, special-purpose processors, application-specific processors, or field-programmable.
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
While exemplary embodiments are described above, it is not intended that these embodiments describe all possible forms of the invention. Rather, the words used in the specification are words of description rather than limitation, and it is understood that various changes may be made without departing from the spirit and scope of the invention. Additionally, the features of various implementing embodiments may be combined to form further embodiments of the invention.