Generative models can perform a wide array of tasks, including speech synthesis (text-to-speech or “TTS”), music generation, image generation, video generation, sound enhancement, image enhancement, video enhancement, text rendering, handwriting rendering, virtual avatar generation, and more. As the use of such models become more widespread, it may become desirable to embed metadata within the synthetic media (e.g., audio, images, video, rendered text) produced by such models to identify the media as having been synthetically generated and/or to leverage the information used to generate the media. In addition, as users interface with media through a growing variety of devices, applications, and transmission formats, it may be advantageous to embed metadata into media (both human-generated and synthetically generated media) to avoid the need to convert formatting of metadata and/or to prevent the metadata from becoming separated from the underlying media.
In some aspects, the present technology concerns systems and methods for steganographic embedding of metadata in media. In that regard, there are many contexts in which it may be beneficial to embed metadata into synthetically generated media. For example, it may be desirable to use steganography to discreetly and/or indelibly identify the fact that the media is synthetic so that people will understand its source and/or authenticity. Likewise, it may be useful to mark such media so that it may be identified and excluded from use as training data for other machine-learning models.
The present technology may also be used to leverage the information used in creating synthetically generated media to simplify the processing of that media by other devices. For example, rather than using complex automatic speech recognition (“ASR”) models to identify the words spoken in a synthetically generated audio or video sample, the original text used to generate the speech may be encoded directly into the synthetically generated video or audio stream using steganography such that a simpler decoder may be used. Likewise, rather than using optical character recognition to recognize rendered text or generated handwriting, the original text used in generating the media may be embedded into the image including the rendered text or generated handwriting using steganography. These processes may reduce the computing resources required to analyze the media, and increase the speed at which such analysis may be undertaken.
The present technology may also leverage the information used to generate synthetic media for the purpose of tuning the output of a generative model. In this way, the generative model may be trained to generate content that is more likely to be decoded accurately by other models. Thus, for example, the original text used to generate a given sample of synthesized speech may be compared to the output of a known ASR model to generate a loss value on which the generative model can be trained. Such a loss value may be used to train (e.g., parameterize) the generative model to create synthetic speech that is more likely to be correctly interpreted by that ASR model, and/or to train the generative model to include steganographic hints (e.g., one or more words, a vector configured to amplify a given classification) in the synthetic speech to bias the ASR model toward the correct the interpretation of the speech. Providing such hints may reduce the data size of the synthetic speech as compared to embedding an entire transcript, for example.
Further, in some aspects of the technology, steganography may be used to embed important metadata into media to avoid the possibility that the metadata may be lost or unreadable by a given device or application. In that regard, as users interface with media through a growing variety of devices, applications, and transmission formats, media of all types (human-generated and model-generated) may need to be converted in ways that make it difficult or impossible to transmit metadata alongside the content. For example, a close-captioning data stream may need to be formatted differently for use by a TV application than for a messaging application, such that close-captioning data transmitted alongside the associated audio and video data will only be visible on some applications but not others. However, if the close-captioning data were instead to be embedded within the associated audio or video stream, separate close-captioning data would not be needed and all applications could use the same decoder to identify and display that data. Likewise, steganography may be used to embed relevant information into media to reduce the computing resources required to analyze the media and increase the speed at which such analysis may be undertaken. For example, where a file type or transmission protocol does not allow for close-captioning data to be provided in a formatted metadata field, the present technology may be used to embed that information into the media file so that it is not necessary to employ a complex ASR model to generate close-captioning data in real-time. Further, steganography may be used to embed relevant information into media beyond what can be included in the file type's existing metadata fields. For example, a system according to the present technology may be configured to use steganography to embed the subjects and landmarks used to generate a synthetic image into the image data itself, or to embed the scores, lyrics, and instruments used to generate synthetic music into the resulting audio data.
In one aspect, the disclosure describes a computer-implemented training method, comprising: (1) generating, using one or more processors of a processing system, a synthetically generated media file based at least in part on first data using a generative model; (2) encoding, using the one or more processors, second data into the synthetically generated media file using a steganography encoder to generate an encoded media file, the second data being based at least in part on the first data; (3) processing, using the one or more processors, the encoded media file using a steganography decoder to generate decoded data; (4) generating, using the one or more processors, an accuracy loss value based at least in part on the second data and the decoded data; and (5) modifying, using the one or more processors, one or both of: one or more parameters of the steganography encoder, based at least in part on the accuracy loss value; or one or more parameters of the steganography decoder, based at least in part on the accuracy loss value. In some aspects, the method further comprises: generating, using the one or more processors, a discriminative loss value based at least in part on processing the encoded media file using a discriminative model; and modifying, using the one or more processors, one or more parameters of the steganography encoder based at least in part on the discriminative loss value. In some aspects, the steganography encoder is a part of the generative model. In some aspects, the first data is a text sequence, the generative model is a text-to-speech model, and the synthetically generated media file is an audio file including synthesized speech generated by the text-to-speech model based at least in part on the text sequence. In some aspects, encoding the second data into the synthetically generated media file using the steganography encoder to generate the encoded media file comprises encoding the text sequence into the encoded media file. In some aspects, encoding the second data into the synthetically generated media file using the steganography encoder to generate the encoded media file comprises encoding a tokenized version of the text sequence into the encoded media file. In some aspects, encoding the second data into the synthetically generated media file using the steganography encoder to generate the encoded media file comprises encoding a vector embedding based on the text sequence into the encoded media file.
In another aspect, the disclosure describes a computer-implemented training method, comprising: generating, using one or more processors of a processing system, a synthetically generated media file based at least in part on first data using a generative model; encoding, using the one or more processors, second data into the synthetically generated media file using a steganography encoder to generate an encoded media file, the second data being based at least in part on the first data; generating, using the one or more processors, a discriminative loss value based at least in part on processing the encoded media file using a discriminative model; and modifying, using the one or more processors, one or more parameters of the steganography encoder based at least in part on the discriminative loss value. In some aspects, the first data is a text sequence, the generative model is a text-to-speech model, and the synthetically generated media file is an audio file including synthesized speech generated by the text-to-speech model based at least in part on the text sequence. In some aspects, encoding the second data into the synthetically generated media file using the steganography encoder to generate the encoded media file comprises encoding the text sequence into the encoded media file. In some aspects, encoding the second data into the synthetically generated media file using the steganography encoder to generate the encoded media file comprises encoding a tokenized version of the text sequence into the encoded media file. In some aspects, encoding the second data into the synthetically generated media file using the steganography encoder to generate the encoded media file comprises encoding a vector embedding based on the text sequence into the encoded media file.
In another aspect, the disclosure describes a computer-implemented training method, comprising: (1) encoding, using one or more processors of a processing system, first data into a media file using a steganography encoder to generate an encoded media file; (2) processing, using the one or more processors, the encoded media file using a steganography decoder to generate decoded data; (3) generating, using the one or more processors, an accuracy loss value based at least in part on the first data and the decoded data; and (4) modifying, using the one or more processors, one or both of: one or more parameters of the steganography encoder, based at least in part on the accuracy loss value; or one or more parameters of the steganography decoder, based at least in part on the accuracy loss value. In some aspects, the method further comprises: generating, using the one or more processors, a discriminative loss value based at least in part on processing the encoded media file using a discriminative model; and modifying, using the one or more processors, one or more parameters of the steganography encoder based at least in part on the discriminative loss value. In some aspects, the media file is a synthetically generated media file generated by a generative model. In some aspects, the media file was generated by the generative model based at least in part on the first data. In some aspects, the steganography encoder is a part of the generative model. In some aspects, the media file is an audio or video file containing speech. In some aspects, encoding the first data into the media file using the steganography encoder to generate the encoded media file comprises encoding a text sequence into the encoded media file, the text sequence including a transcript or a translation of at least a portion of the speech. In some aspects, encoding the first data into the media file using the steganography encoder to generate the encoded media file comprises encoding a tokenized version of a text sequence into the encoded media file, the text sequence including a transcript or a translation of at least a portion of the speech. In some aspects, encoding the first data into the media file using the steganography encoder to generate the encoded media file comprises encoding a vector embedding based on a text sequence into the encoded media file, the text sequence including a transcript or a translation of at least a portion of the speech.
In another aspect, the disclosure describes a computer-implemented training method, comprising: generating, using one or more processors of a processing system, a synthetically generated media file based at least in part on first data using a generative model; processing, using the one or more processors, the synthetically generated media file using an interpretive model to generate first interpreted data; generating, using the one or more processors, a first accuracy loss value based at least in part on the first data and the first interpreted data; and modifying, using the one or more processors, one or more parameters of the generative model based at least in part on the first accuracy loss value. In some aspects, the first data is a text sequence, the generative model is a text-to-speech model, the synthetically generated media file is an audio file including synthesized speech generated by the text-to-speech model based at least in part on the text sequence, and the interpretive model is an automatic speech recognition model. In some aspects, the method further comprises: identifying, using the one or more processors, a difference between the first data and the first interpreted data; encoding, using the one or more processors, second data into the synthetically generated media file using a steganography encoder to generate an encoded media file, the second data being based at least in part on the identified difference; processing, using the one or more processors, the encoded media file using the interpretive model to generate second interpreted data; generating, using the one or more processors, a second accuracy loss value based at least in part on the first data and the second interpreted data; and modifying, using the one or more processors, one or more parameters of the steganography encoder based at least in part on the first accuracy loss value and the second accuracy loss value. In some aspects, the method further comprises: modifying, using the one or more processors, one or more parameters of the generative model based at least in part on the first accuracy loss value and the second accuracy loss value. In some aspects, the steganography encoder is a part of the generative model. In some aspects, the first data is a first text sequence, the first interpreted data is a second text sequence, and the identified difference comprises one or more words or characters that differ between the first text sequence and the second text sequence. In some aspects, encoding the second data into the synthetically generated media file using the steganography encoder to generate the encoded media file comprises encoding the one or more words or characters into the encoded media file. In some aspects, encoding the second data into the synthetically generated media file using the steganography encoder to generate the encoded media file comprises encoding a tokenized version of the one or more words or characters into the encoded media file. In some aspects, encoding the second data into the synthetically generated media file using the steganography encoder to generate the encoded media file comprises encoding a vector embedding based on the one or more words into the encoded media file.
In another aspect, the disclosure describes a computer-implemented method of outputting a media file, comprising: processing, using one or more processors of a processing system, an encoded media file using a steganography decoder to generate decoded data; outputting, using the one or more processors, media content of the encoded media file; determining, using the one or more processors, based on the decoded data, whether the media content of the encoded media file was generated by a generative model; and based on the determination, outputting, using the one or more processors, an indication of whether the encoded media file was generated by a generative model. In some aspects, outputting the indication of whether the encoded media file was generated by a generative model is performed in response to receiving an input from a user. In some aspects, the method further comprises outputting, using the one or more processors, the decoded data. In some aspects, the media file was generated by a generative model based at least in part on the decoded data. In some aspects, outputting the decoded data is performed in response to receiving an input from a user.
In another aspect, the disclosure describes a computer-implemented media generation method, comprising: generating, using one or more processors of a processing system, a synthetically generated media file based at least in part on first data using a generative model; and encoding, using the one or more processors, second data into the synthetically generated media file using a steganography encoder to generate an encoded media file, the second data being based at least in part on the first data. In some aspects, the steganography encoder is a part of the generative model. In some aspects, the first data is a text sequence, the generative model is a text-to-speech model, and the synthetically generated media file is an audio file including synthesized speech generated by the text-to-speech model based at least in part on the text sequence. In some aspects, encoding the second data into the synthetically generated media file using the steganography encoder to generate the encoded media file comprises encoding one or more words of the text sequence into the encoded media file. In some aspects, encoding the second data into the synthetically generated media file using the steganography encoder to generate the encoded media file comprises encoding a tokenized version of one or more words of the text sequence into the encoded media file. In some aspects, encoding the second data into the synthetically generated media file using the steganography encoder to generate the encoded media file comprises encoding a vector embedding based on one or more words of the text sequence into the encoded media file. In some aspects, encoding the second data into the synthetically generated media file using the steganography encoder to generate the encoded media file comprises encoding a vector into the encoded media file, the vector representing a classification generated by an interpretive model based on the text sequence.
In another aspect, the disclosure describes a processing system comprising: (A) a memory storing a generative model, a steganography encoder, and a steganography decoder; and (B) one or more processors coupled to the memory and configured to train one or both of the steganography encoder or the steganography encoder, comprising: (1) generating a synthetically generated media file based at least in part on first data using the generative model; (2) encoding second data into the synthetically generated media file using the steganography encoder to generate an encoded media file, the second data being based at least in part on the first data; (3) processing the encoded media file using the steganography decoder to generate decoded data; (4) generating an accuracy loss value based at least in part on the second data and the decoded data; and (5) modifying one or both of: one or more parameters of the steganography encoder, based at least in part on the accuracy loss value; or one or more parameters of the steganography decoder, based at least in part on the accuracy loss value. In some aspects, the memory further stores a discriminative model, and the one or more processors are further configured to: generate a discriminative loss value based at least in part on processing the encoded media file using the discriminative model; and modify one or more parameters of the steganography encoder based at least in part on the discriminative loss value. In some aspects, the steganography encoder is a part of the generative model. In some aspects, encoding the second data into the synthetically generated media file using the steganography encoder to generate the encoded media file comprises encoding a text sequence into the encoded media file. In some aspects, encoding the second data into the synthetically generated media file using the steganography encoder to generate the encoded media file comprises encoding a tokenized version of a text sequence into the encoded media file. In some aspects, encoding the second data into the synthetically generated media file using the steganography encoder to generate the encoded media file comprises encoding a vector embedding based on a text sequence into the encoded media file.
In another aspect, the disclosure describes a processing system comprising: (A) a memory storing a generative model, a steganography encoder, and a discriminative model; and (B) one or more processors coupled to the memory and configured to train the steganography encoder, comprising: generating a synthetically generated media file based at least in part on first data using the generative model; encoding second data into the synthetically generated media file using the steganography encoder to generate an encoded media file, the second data being based at least in part on the first data; generating a discriminative loss value based at least in part on processing the encoded media file using the discriminative model; and modifying one or more parameters of the steganography encoder based at least in part on the discriminative loss value. In some aspects, encoding the second data into the synthetically generated media file using the steganography encoder to generate the encoded media file comprises encoding a text sequence into the encoded media file. In some aspects, encoding the second data into the synthetically generated media file using the steganography encoder to generate the encoded media file comprises encoding a tokenized version of a text sequence into the encoded media file. In some aspects, encoding the second data into the synthetically generated media file using the steganography encoder to generate the encoded media file comprises encoding a vector embedding based on a text sequence into the encoded media file.
In another aspect, the disclosure describes a processing system comprising: (A) a memory storing a steganography encoder and a steganography decoder; and (B) one or more processors coupled to the memory and configured to train one or both of the steganography encoder or the steganography encoder, comprising: (1) encoding first data into a media file using the steganography encoder to generate an encoded media file; (2) processing the encoded media file using the steganography decoder to generate decoded data; (3) generating an accuracy loss value based at least in part on the first data and the decoded data; and (4) modifying one or both of: one or more parameters of the steganography encoder, based at least in part on the accuracy loss value; or one or more parameters of the steganography decoder, based at least in part on the accuracy loss value. In some aspects, the memory further stores a discriminative model, and the one or more processors are further configured to: generate a discriminative loss value based at least in part on processing the encoded media file using the discriminative model; and modify one or more parameters of the steganography encoder based at least in part on the discriminative loss value. In some aspects, the steganography encoder is a part of a generative model. In some aspects, encoding the first data into the media file using the steganography encoder to generate the encoded media file comprises encoding a text sequence into the encoded media file. In some aspects, encoding the first data into the media file using the steganography encoder to generate the encoded media file comprises encoding a tokenized version of a text sequence into the encoded media file. In some aspects, encoding the first data into the media file using the steganography encoder to generate the encoded media file comprises encoding a vector embedding based on a text sequence into the encoded media file.
In another aspect, the disclosure describes a processing system comprising: (A) a memory storing a generative model and an interpretive model; and (B) one or more processors coupled to the memory and configured to train the generative model, comprising: generating a synthetically generated media file based at least in part on first data using the generative model; processing the synthetically generated media file using the interpretive model to generate first interpreted data; generating a first accuracy loss value based at least in part on the first data and the first interpreted data; and modifying one or more parameters of the generative model based at least in part on the first accuracy loss value. In some aspects, the memory further stores a steganography encoder, and the one or more processors are further configured to: identify a difference between the first data and the first interpreted data; encode second data into the synthetically generated media file using the steganography encoder to generate an encoded media file, the second data being based at least in part on the identified difference; process the encoded media file using the interpretive model to generate second interpreted data; generate a second accuracy loss value based at least in part on the first data and the second interpreted data; and modify one or more parameters of the steganography encoder based at least in part on the first accuracy loss value and the second accuracy loss value. In some aspects, the one or more processors are further configured to: modify one or more parameters of the generative model based at least in part on the first accuracy loss value and the second accuracy loss value. In some aspects, the steganography encoder is a part of the generative model. In some aspects, encoding the second data into the synthetically generated media file using the steganography encoder to generate the encoded media file comprises encoding one or more words or characters into the encoded media file. In some aspects, encoding the second data into the synthetically generated media file using the steganography encoder to generate the encoded media file comprises encoding a tokenized version of one or more words or characters into the encoded media file. In some aspects, encoding the second data into the synthetically generated media file using the steganography encoder to generate the encoded media file comprises encoding a vector embedding based on one or more words into the encoded media file.
In another aspect, the disclosure describes a processing system comprising: (A) a memory storing a steganography decoder; and (B) one or more processors coupled to the memory and configured to: process an encoded media file using the steganography decoder to generate decoded data; output media content of the encoded media file; determine, based on the decoded data, whether the media content of the encoded media file was generated by a generative model; and based on the determination, output an indication of whether the encoded media file was generated by a generative model. In some aspects, the one or more processors are further configured to output the indication of whether the encoded media file was generated by a generative model in response to receiving an input from a user. In some aspects, the one or more processors are further configured to output the decoded data. In some aspects, the one or more processors are further configured to output the decoded data in response to receiving an input from a user.
In another aspect, the disclosure describes a processing system comprising: (A) a memory storing a generative model and a steganography encoder; and (B) one or more processors coupled to the memory and configured to: generate a synthetically generated media file based at least in part on first data using a generative model; and encode second data into the synthetically generated media file using a steganography encoder to generate an encoded media file, the second data being based at least in part on the first data. In some aspects, the steganography encoder is a part of the generative model. In some aspects, encoding the second data into the synthetically generated media file using the steganography encoder to generate the encoded media file comprises encoding one or more words of the first data into the encoded media file. In some aspects, encoding the second data into the synthetically generated media file using the steganography encoder to generate the encoded media file comprises encoding a tokenized version of one or more words of the first data into the encoded media file. In some aspects, encoding the second data into the synthetically generated media file using the steganography encoder to generate the encoded media file comprises encoding a vector embedding based on one or more words of the first data into the encoded media file. In some aspects, encoding the second data into the synthetically generated media file using the steganography encoder to generate the encoded media file comprises encoding a vector into the encoded media file, the vector representing a classification generated by an interpretive model based on the first data.
The present technology will now be described with respect to the following exemplary systems and methods.
Processing system 102 may be resident on a single computing device. For example, processing system 102 may be a server, personal computer, or mobile device, and a given model may thus be local to that single computing device. Similarly, processing system 102 may be resident on a cloud computing system or other distributed system. In such a case, a given model may be distributed across two or more different physical computing devices. For example, in some aspects of the technology, the processing system may comprise a first computing device storing layers 1-n of a given model having m layers, and a second computing device storing layers n-m of the given model.
Further in this regard,
The processing systems described herein may be implemented on any type of computing device(s), such as any type of general computing device, server, or set thereof, and may further include other components typically present in general purpose computing devices or servers. Likewise, the memory of such processing systems may be of any non-transitory type capable of storing information accessible by the processor(s) of the processing systems. For instance, the memory may include a non-transitory medium such as a hard-drive, memory card, optical disk, solid-state, tape memory, or the like. Computing devices suitable for the roles described herein may include different combinations of the foregoing, whereby different portions of the instructions and data are stored on different types of media.
In all cases, the computing devices described herein may further include any other components normally used in connection with a computing device such as a user interface subsystem. The user interface subsystem may include one or more user inputs (e.g., a mouse, keyboard, touch screen and/or microphone) and one or more electronic displays (e.g., a monitor having a screen or any other electrical device that is operable to display information). Output devices besides an electronic display, such as speakers, lights, and vibrating, pulsing, or haptic elements, may also be included in the computing devices described herein.
The one or more processors included in each computing device may be any conventional processors, such as commercially available central processing units (“CPUs”), graphics processing units (“GPUs”), tensor processing units (“TPUs”), etc. Alternatively, the one or more processors may be a dedicated device such as an ASIC or other hardware-based processor. Each processor may have multiple cores that are able to operate in parallel. The processor(s), memory, and other elements of a single computing device may be stored within a single physical housing, or may be distributed between two or more housings. Similarly, the memory of a computing device may include a hard drive or other storage media located in a housing different from that of the processor(s), such as in an external database or networked storage device. Accordingly, references to a processor or computing device will be understood to include references to a collection of processors or computing devices or memories that may or may not operate in parallel, as well as one or more servers of a load-balanced server farm or cloud-based system.
The computing devices described herein may store instructions capable of being executed directly (such as machine code) or indirectly (such as scripts) by the processor(s). The computing devices may also store data, which may be retrieved, stored, or modified by one or more processors in accordance with the instructions. Instructions may be stored as computing device code on a computing device-readable medium. In that regard, the terms “instructions” and “programs” may be used interchangeably herein. Instructions may also be stored in object code format for direct processing by the processor(s), or in any other computing device language including scripts or collections of independent source code modules that are interpreted on demand or compiled in advance. By way of example, the programming language may be C#, C++, JAVA or another computer programming language. Similarly, any components of the instructions or programs may be implemented in a computer scripting language, such as JavaScript, PHP, ASP, or any other computer scripting language. Furthermore, any one of these components may be implemented using a combination of computer programming languages and computer scripting languages.
In the example of
Likewise, in the example of
Further, in the example of
Although the examples of
The only difference between the examples of
In the example of
Although the example of
In the example of
Likewise, in the example of
Further, in the example of
In each of the examples of
The steganography decoder 602 of
In the examples of
As described further below with respect to
In the example of
The interpretive model 702 of
As with the examples of
As described further below with respect to
Likewise, as described further below with respect to
In each of the examples of
In some aspects of the technology, the discriminative loss value 804 may be generated directly by the discriminative model based on the encoded media file (310, 406, or 508). Likewise, in some aspects of the technology, the discriminative model 802 may process the encoded media file (310, 406, or 508) to generate an output (e.g., a score or probability that the encoded media file is real, a classification of “real” or “fake,” etc.) based on which the processing system (e.g., processing system 102) then generates a discriminative loss value 804. In either case, the discriminative loss value 804 may be generated based on any suitable paradigm. For example, in some aspects of the technology, the discriminative loss value 804 may be based on how likely the encoded media file (310, 406, or 508) is to be real. Likewise, in some aspects of the technology, the discriminative loss value 804 may be one value if the encoded media file (310, 406, or 508) is predicted to be real (e.g., 1), and another value if the encoded media file (310, 406, or 508) is predicted to be fake (e.g., 0).
As described further below with respect to
Although the exemplary process flows of
In step 902, a processing system (e.g., processing system 102) generates a synthetically generated media file based at least in part on first data using a generative model. This may be done in any suitable way. Thus, the generative model may be any suitable generative model, including any of the options described above with respect to generative model 304 of
In step 904, the processing system encodes second data into the synthetically generated media file using steganography to generate an encoded media file, the second data being based at least in part on the first data. This encoding may be applied by the generative model (e.g., as described above with respect to model 404 of
In step 1004, the processing system (e.g., processing system 102) processes the encoded media file (generated in step 904 of
In step 1006, the processing system generates an accuracy loss value based at least in part on the second data and the decoded data. The processing system may use the second data and the decoded data to generate this accuracy loss value in any suitable way, including any of the options described above with respect to the generation of accuracy loss value 606 of
In step 1008, the processing system modifies one or more parameters of the steganography encoder and/or the steganography decoder based at least in part on the accuracy loss value. In that regard, modifying one or more parameters of the steganography encoder may involve: (a) where the generative model is configured to apply the encoding to the synthetically generated media file (e.g., as described above with respect to model 404 of
The processing system may be configured to modify the one or more parameters based on the accuracy loss value in any suitable way, and at any suitable interval. Thus, in some aspects of the technology, the processing system may be configured to conduct a back-propagation step in which it modifies the one or more parameters of the steganography encoder and/or steganography decoder every time an accuracy loss value is generated. Likewise, in some aspects, the processing system may be configured to wait until a predetermined number of accuracy loss values have been generated, combine those values into an aggregate accuracy loss value (e.g., by summing or averaging the multiple accuracy loss values), and modify the one or more parameters of the steganography encoder and/or steganography decoder based on that aggregate accuracy loss value.
In step 1104, the processing system (e.g., processing system 102) generates a discriminative loss value based at least in part on processing the encoded media file using a discriminative model. This may be done in any suitable way. Thus, the discriminative model may be any suitable type, including any of the options described above with respect to discriminative model 802 of
In step 1106, the processing system modifies one or more parameters of the steganography encoder based at least in part on the discriminative loss value. Here as well, modifying one or more parameters of the steganography encoder may involve: (a) where the generative model is configured to apply the encoding to the synthetically generated media file (e.g., as described above with respect to model 404 of
The processing system may be configured to modify the one or more parameters based on the discriminative loss value in any suitable way, and at any suitable interval. Thus, in some aspects of the technology, the processing system may be configured to conduct a back-propagation step in which it modifies the one or more parameters of the steganography encoder every time a discriminative loss value is generated. Likewise, in some aspects, the processing system may be configured to wait until a predetermined number of discriminative loss values have been generated, combine those values into an aggregate discriminative loss value (e.g., by summing or averaging the multiple discriminative loss values), and modify the one or more parameters of the steganography encoder based on that aggregate discriminative loss value.
Further, where steps 1002-1008 of
In step 1202, a processing system (e.g., processing system 102) encodes first data into a media file using a steganography encoder to generate an encoded media file. This media file may be of any suitable type (e.g., audio, image, video, etc.), and may have been generated in any suitable way (e.g., synthetically generated, human-generated, etc.), including any of the options discussed above with respect to media file 502 of
In step 1204, the processing system processes the media file using a steganography decoder to generate decoded data. This step may be performed in any suitable way, as described above with respect to step 1004 of
In step 1206, the processing system generates an accuracy loss value based at least in part on the first data and the decoded data. As with step 1006 of
In step 1208, the processing system modifies one or more parameters of the steganography encoder and/or the steganography decoder based at least in part on the accuracy loss value. As with step 1008 of
In step 1304, the processing system (e.g., processing system 102) generates a discriminative loss value based at least in part on processing the encoded media file using a discriminative model. This step may be performed in any suitable way, as described above with respect to step 1104 of
In step 1306, the processing system modifies one or more parameters of the steganography encoder based at least in part on the discriminative loss value. As with step 1106 of
Further, where steps 1204-1208 of
In step 1402, a processing system (e.g., processing system 102) generates a synthetically generated media file based at least in part on first data using a generative model. As with step 902 of
In step 1404, the processing system processes the synthetically generated media file using an interpretive model to generate first interpreted data. This may be done in any suitable way. Thus, the interpretive model may be any suitable type of model configured to interpret the content of a particular type of media file (e.g., audio, image, video, rendered text, etc.) in order to generate first interpreted data, including any of the options described above with respect to interpretive model 702 of
In step 1406, the processing system generates a first accuracy loss value based at least in part on the first data and the first interpreted data. The processing system may use the first data and the first interpreted data to generate this first accuracy loss value in any suitable way, including any of the options described above with respect to the generation of accuracy loss value 706 of
In step 1408, the processing system modifies one or more parameters of the generative model based at least in part on the first accuracy loss value. As discussed above with respect to
In step 1504, the processing system (e.g., processing system 102) identifies a difference between the first data and the first interpreted data. The processing system may identify a difference between the first data and the first interpreted data in any suitable way. Thus, in some aspects of the technology, the processing system may be configured to compare the content of the first data to the content of the first interpreted data to identify a difference between them. For example, where the first data and the first interpreted data are in text format, the processing system may compare the text of the first data to the text of the first interpreted data to identify one or more words or characters that differ. Likewise, in some aspects of the technology, the processing system may be configured to identify a difference between the first data and the first interpreted data indirectly, such as by comparing a vector based on the first data to a vector based on the first interpreted data. For example, where the interpretive model is configured to output a vector based on the synthetically generated media file (e.g., a vector representing the interpretive model's classification of an object depicted in a synthetically generated image), the processing system may be configured to likewise generate a vector based on the first data (e.g., using a learned embedding function) so that it may be compared to the output of the interpretive model to identify any differences between how the first data and the first interpreted data would be classified.
In step 1506, the processing system encodes second data into the synthetically generated media file using a steganography encoder to generate an encoded media file, the second data being based at least in part on the difference identified in step 1504. Here as well, this encoding may be applied by a steganography encoder that is part of the generative model (e.g., as described above with respect to model 404 of
In the example of
In addition, in some aspects of the technology, the second data may simply be related to the identified difference. For example, where the interpretive model is configured to output a vector based on the synthetically generated media file (e.g., a vector representing the interpretive model's classification of an object depicted in a synthetically generated image), the second data may be a vector representing a prediction of the correct classification (e.g., the classification produced by applying a learned embedding function to the first data).
In step 1508, the processing system processes the encoded media file using the interpretive model (used previously in step 1404 of
In step 1510, the processing system generates a second accuracy loss value based at least in part on the first data and the second interpreted data. Here as well, the processing system may use the first data and the second interpreted data to generate this second accuracy loss value in any suitable way, including any of the options described above with respect to the generation of accuracy loss value 706 of
In step 1512, the processing system modifies one or more parameters of the generative model and/or the steganography encoder based at least in part on the first accuracy loss value and the second accuracy loss value. Here again, as discussed above with respect to
In addition, the processing system may be configured to modify the one or more parameters of the generative model and/or the steganography encoder based on the first accuracy loss value and the second accuracy loss value at any suitable interval. Thus, in some aspects of the technology, the processing system may be configured to conduct a back-propagation step in which it modifies the one or more parameters of the generative model and/or the steganography encoder every time a pair of first and second accuracy loss values are generated. Likewise, in some aspects, the processing system may be configured to wait until a predetermined number of accuracy loss value pairs have been generated, use those accuracy loss value pairs to generate one or more aggregate accuracy loss values (e.g., by summing or averaging all of the first accuracy loss values to generate a first aggregated accuracy loss value, and summing or averaging all of the second loss values to generate a second aggregated accuracy loss value, etc.), and modify the one or more parameters of the generative model based on those aggregate accuracy loss values.
In step 1602, a processing system (e.g., processing system 102) processes an encoded media file using a steganography decoder to generate decoded data. As discussed above with respect to step 1004 of
In step 1604, the processing system outputs the media content of the encoded media file. This may be done using any suitable utility and/or hardware for displaying or playing the type of media within the encoded media file. For example, if the content of the encoded media file includes an image, the processing system may output the image by providing an instruction for the image to be displayed on a monitor, printer, or other type of display device. Likewise, if the content of the encoded media file includes a video, the processing system may output the video by providing an instruction for visual data of the video to be displayed on a monitor or other type of display device and/or for audio data of the video to be played on a speaker or other audio output device. Similarly, if the content of the encoded media file includes audio data, the processing system may output the audio data by providing an instruction for the audio data to be played on a speaker or other audio output device, and/or by instructing that a visualization of the audio data's content (e.g., a graph of the audio data's waveform) be displayed on a monitor, printer, or other type of display device.
In step 1606, the processing system determines whether the media content of the encoded media file was generated by a generative model based on the decoded data. The processing system may use the decoded data to make this determination in any suitable way. Thus, in some aspects of the technology, the processing system may be configured to determine that the media content of the encoded media file was generated by a generative model based solely on the fact that the steganography decoder was able to extract decoded data from the encoded media file. Likewise, in some aspects of the technology, the processing system may be configured to determine whether the media content of the encoded media file was generated by a generative model based on the content of the decoded data. For example, in some aspects of the technology, the decoded data may include an indication that the media content of the encoded media file was generated by a generative model.
In step 1608, based on the determination of step 1606, the processing system outputs an indication of whether the encoded media file was generated by a generative model. Here as well, the processing system may output this indication using any suitable utility and/or hardware. For example, the processing system may output a message indicating that the encoded media file was or was not generated by a generative model by providing an instruction for the message to be displayed on a monitor, printer, or other type of display device, or by providing an instruction for a synthesized reading of the message to be played on a speaker or other audio output device. Likewise, in some aspects of the technology, the processing system may output any other suitable type of indication of its determination, such as by providing an instruction for an icon or image to be displayed on a display device, for all or a portion of a screen to blink, for a sound to be played on a speaker or other audio output device, etc. As will be understood, in some aspects of the technology, the processing system may be configured to only output this indication based on certain preconditions (e.g., in response to a request from a user).
In step 1610, the processing system outputs the decoded data. Step 1610 is an optional step within exemplary method 1600, and may be performed at any suitable time relative to the other steps. For example, the processing system may output the decoded data before, after, or at the same time as it outputs the media content of the media file (step 1604). Likewise, the processing system may output the decoded data before, after, or at the same time as it outputs the indication of whether the encoded media file was generated by a generative model (step 1608). Here as well, the processing system may output the decoded data using any suitable utility and/or hardware. For example, if the decoded data is a sequence of text (e.g., close-captioning content of a video file, text used to synthetically generate speech in a video or audio file, etc.), the processing system may output the sequence of text by providing an instruction for the text to be displayed on a monitor or other type of display device. In addition, in some aspects of the technology, the processing system may be configured to only output the decoded data based on certain preconditions (e.g., in response to a request from a user).
Unless otherwise stated, the foregoing alternative examples are not mutually exclusive, but may be implemented in various combinations to achieve unique advantages. As these and other variations and combinations of the features discussed above can be utilized without departing from the subject matter defined by the claims, the foregoing description of exemplary systems and methods should be taken by way of illustration rather than by way of limitation of the subject matter defined by the claims. In addition, the provision of the examples described herein, as well as clauses phrased as “such as,” “including,” “comprising,” and the like, should not be interpreted as limiting the subject matter of the claims to the specific examples; rather, the examples are intended to illustrate only some of the many possible embodiments. Further, the same reference numbers in different drawings can identify the same or similar elements.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2021/054550 | 10/12/2021 | WO |