The general field of application of the present invention involves techniques for embedding watermarks into digital media files; the invention being more particularly directed to the embedding of a reversible watermark into such digital media files, and then removing such a watermark, in whole or in part, at some later date, without access to the original media file. Among such media type files are audio, image, video, 3-D and the like; such watermarks being primarily intended for, though not limited to, the introduction of perceptually significant elements into the media, such as pseudorandom noise, tonal elements, or vocal elements, such that the media are degraded but suitable merely for demonstration or trial purposes. The watermark is resistant to removal without proper authorization and these perceptually significant elements can then be removed from the media to prepare it for its ultimate high-quality use.
The use of peer-to-peer file sharing systems such as Napster, Grokster, Kazaa, and BitTorrent has grown greatly in recent years, primarily due to the wide community of users interested in sharing digital media with one another. Recent developments in peer-to-peer sharing, such as podcasting, where users create pre-mixed downloadable streams of music, as well as the providing of the ability for people to easily share files in public settings through, for example, 802.11 and Bluetooth, have demonstrated that users have an increasing desire to share digital media with one another.
Unfortunately for the content creation industry, much of that digital media is being shared without the payment of appropriate fees. Because of this, the content creation industry has strongly encouraged the development of Digital Rights Management (DRM) techniques to technologically regulate how files are shared. In the licensing mechanism known as “superdistribution,” first described by Ryoichi Mori, people are allowed to share media with each other, but must receive a separate license in order to be able to freely enjoy the content:
Mori, Ryoichi, and Masaji Kawahara, “Superdistribution: The Concept and the Architecture,” The Transactions of the IEICE; Vol. E 73, No. 7 July 1990, Special Issue on Cryptography and Information Security.
There are difficulties, however, with the two previous main approaches to such superdistribution: watermarked formats, and encrypted envelopes.
If the music is distributed unsecured in an open format, such as mp3, which contains an embedded DRM watermark, the supplier is dependent on every link in the chain being perfectly secure in order to enforce the DRM. Since the music is typically stored unencrypted, a hacker can still bypass the DRM and get a full-quality version of the song from the media player storage.
On the other hand, if the media are distributed in a proprietary, secure envelope format, very few media players will be able to play it without upgrading, etc. This “all or nothing” approach slows the adoption of the many proprietary secure media formats that have been developed over the years.
The approach of the present invention, accordingly, is designed to counter the weaknesses of these two prior approaches, by providing a secure container, implemented using open standards, which is nevertheless partially playable in an unsecured or unmodified environment.
Prior watermarking techniques typically embed human-perceptible or machine-readable information into a media stream, so that this embedded information is robust to the degradation and manipulation of the media. In the normal use scenario, a media producer will add a watermark to the media file in order to be able to track the following distribution of that file, and to discourage unauthorized use. Typical watermarking techniques rely on gross characteristics of the signal being preserved through common types of transformations applied to a media file.
Unlike the system of the present invention, however, they are explicitly designed not to be reversible, and, indeed, greatly degrade the quality of the file if they should be removed.
A survey of techniques for multimedia data labeling, and particularly for copyright labeling using watermark is presented by Langelaar, G. C. et al. in “Copy Protection For Multimedia Data based on Labeling Techniques”(http://www-it.et.tudelft.nl/html/research/smash/public/benlx96/benelux_cr.html).
The earlier cited Langelaar et al publication, in turn, references and discusses the following additional prior art publications:
An additional article by Langelaar also discloses earlier labeling of MPEG compressed video formats:
These Zhao and Koch, Boland et al and Langelaar et al disclosures, while teaching encoding technique approaches having partial similitude to components of the techniques employed by the present invention, as will now be more fully explained, are not, however, either anticipatory of, or actually adapted for providing for the removal of such data at a later date, without drastically impairing the quality of the media and the usability thereof.
Considering, first, the approach of Zhao and Koch, above-referenced, they embed a signal in an image by using JPEG-based techniques. ([JPEG] Digital Compression and Coding of Continuous-tone Still Images, Part 1: Requirements and guidelines, ISO/IEC DIS 10918-1. They first encode a signal in the ordering of the size of three coefficients, chosen from the middle frequency range of the coefficients in an 8-block or octet DCT. They divide eight permutations of the ordering relationship among these three coefficients into three groups: one encoding a ‘1’ bit (HML, MHL, and HHL), one encoding a ‘0’ bit (MLH, LMH, and LLH), and a third group encoding “no data” (HLM, LHM, and MMM). They have also extended this technique to the watermarking of video data. While their technique is robust and resilent to modifications, they do not, however, provide for the removal of such data. As will later more fully be explained, this is a disadvantage overcome by the present invention.
As for Boland, Ruanaidh, and Dautzenberg, they use a technique of generating the DCT Walsh Transform, or Wavelet Transform of an image, and then adding one to a selected coefficient to encode a “1” bit, or subtracting one from a selected coefficient to encode a “0” bit. This technique, although at first blush somewhat superficially similar in one aspect of one component of the present invention, has the very significant limitation obviated by the present invention, that information can only be extracted by comparing the encoded image with the original image. This means that a watermarked and a non-watermarked copy of any media file must be sent simultaneously in order for the watermarking to work. This is a rather severe limitation, completely overcome by the current invention. In addition to being impossible to verify the existence of a watermark without a copy of the original media, it is also impossible to remove the watermark using their technique.
Various forms of perceptually imperceptible watermarking were also developed and tested as part of the Secure Digital Music Initiative, but were subsequently abandoned during pre-release testing, after songs were quickly hacked to remove the watermark, though this was at the cost of further degrading the quality of the music—again unlike in the present invention.
There are many implementations of secure envelopes to provide DRM techniques. Typically, they create a container file which contains an encrypted media stream, and which can be unlocked with an appropriate license key. Unlike the current invention, however, they cannot be played in any form when the user either has an incompatible player, or is not licensed to play that content. Often, the players allow for limited previewing of the content without a license, but such previewing nevertheless requires a proprietary player capable of reading the container file and extracting the media. This once more is contrasted from the present invention, where the content is stored in a standard media format, capable of being read and played at lower quality without proprietary means.
The invention herein might be described as a middle path between the two classes of prior techniques—watermarking and secure envelopes, novelly combining the ubiquity of open formats with the power of an encrypted envelope. This novel “try before you buy” approach does not even require the user to have a special player to try the music.
A typical use scenario, might be as follows. Bob meets Alice at a coffee shop, and is impressed with the Balinese music collection on her mp3 player, so downloads all these songs onto his cell phone. When Bob plays them later, the first 30 seconds of each song play well enough for him to hear the quality of the recording, but after that, the embedded watermarked noise reduces the quality to below that of an AM radio broadcast. However, if he likes a song and purchases a license to it, the entire song is restored, and plays at its original high quality.
The present invention creates a standard media file that has an audible, reversible watermark added. Upon appropriate licensing, this watermark can either be temporarily removed during the decoding and playback process, or can be permanently removed from the media file. Generally, for security reasons, in a situation which allows for further sharing of the media file, the watermark will be temporarily removed only on the in-memory version, as part of the playback process.
A system described in European patent application EP 1 465 157 A1 also implements a similar system to that described here, which is capable of inserting an apparent watermark and later removing it. Unlike the system of the present invention, however, which uses reversible mathematical operations to insert and remove the watermark, it relies on the copying of saved data from watermarked sections to unused portions of the audio file; for example, to ancillary portions. The drawbacks of that system are that the file must necessarily increase substantially in size to accommodate this saved data. In that specific approach, changing (adding) about 100 bytes per frame, or 32 kilobits/second of size to the data file to store this.
The present invention, on the other hand, does not have the limitation of requiring that this “undo” information be saved, since the watermark is added by using reversible operations. In the system of the invention, only a few bytes per frame (compared to 100 bytes per frame in said European patent application system) are necessary to be stored to recreate the noise envelope so that it can be removed.
Because the number of bytes needed is so much smaller, the present invention can take advantage of techniques such as those described in applicant's earlier U.S. Pat. Nos. 6,748,362 (dealing with embedding data in media files) and 6,768,980 (dealing with steganographic embedding of data in digital measurements) to embed those few bytes of data, without needing to increase the file size at all. Additionally, while the system of the present invention is capable of embedding data in a multitude of media formats, prior systems are limited to spectrally-encoded audio signals, such as mp3, only.
In World Intellectual Property Organization application WO 99/55089, still a different type of system is described which “scrambles” the bits of a music file by interchanging portions of an audio sample with other nearby portions in the file. This technique again differs from the present invention, which does not rely on interchanging data at all. The following publications, however, do describe a system with similar intent to the present invention, with a technique they describe as “Bitwise XOR of least significant bits of the quantized spectral coefficients with a key-dependent pseudo random number sequence.”
Herre, Jürgen and Eric Allamanche, “Compatible Scrambling of Compressed Audio,” Proceedings 1999 IEEE Workshop on Applications of signal Processing to Audio and Acoustics, New Paltz, N.Y., Oct. 17-20, 1999.
Unlike the technique of the present invention, though, this compatible scrambling technique is not described as using the step of analyzing the content of the media, so as to add the proper amount of noise to the media, and so do not overcome the limitation that they are unable to vary the amount of noise introduced dynamically. Instead, they describe the use of an alternate method, “reordering of spectral coefficients,” which they say “produced the most uniform distortion for all types of audio material.”
They confirm this in a subsequent paper, where they describe the same technique as “Bitwise XOR with Spectral Coefficients”:
In the system described therein, on pp. 8-9 of that document, they state that “subjective informal testing of the perceptual degradations showed that the distortions produced for incorrect descrambling depend on the type of audio material encoded and may, for some cases, not be strong enough to discourage illegal listening.” They were not able to overcome this problem, so their final solution proposes a different system involving the swapping of various coefficients. In accordance with the present invention, a significant, novel and non-obvious improvement is provided which addresses this problem; namely, analyzing the media properties, creating a customized watermark designed to be perceptually intrusive in the context of that media, and then storing the watermarking parameters, as later more fully detailed.
Still another patent application EP 1 189 372 A2 teaches a system wherein a noise signal, which is at “a level of sound perceivable by the human sense of hearing signal,” is added to an existing signal. In their system, an audio signal is separated out into a number of frequency bands. The “telephone voice band” of 300-3,400 Hz is separated out to have noise signal parameters stored in it using imperceptible watermarking techniques. The remaining information has a noise signal added to it based on these noise signal parameters.
Although this may appear superficially similar to the system of the present invention, there are significant differences. They do not teach, as do applicants, how to limit the added noise signal to avoid having media compressed with a frequency-domain codec (such as mp3) affected by the addition of the noise signal, during the encoding process. They must also embed the noise signal parameters in untouched frequency bands, whereas the present invention is able to embed the noise signal parameters in the same frequency bands where the noise signal itself is embedded. They also only teach the embedding of a single “third key”, which appears to be their term for the noise signal parameters, in the song (paragraphs 78-79, 116, and 122), so do not teach how to have noise signal parameters, which change from moment to moment throughout the song.
The techniques taught as part of the present invention, furthermore, as later detailed may be used with any type of digital media file, including music formats such as but not limited to CD, mp3, AAC, ATRAC, WMA, GSM, and CDMA; image formats such as but not limited to JPEG, TIFF, and GIF; video formats such as but not limited to MPEG, MPEG-2, H.264, and VC-1; and 3D formats such as but not limited to VRML, Web3D, and volumetric data.
One area of great use for this invention is to enable content providers to release music and video over the Internet such that it can be freely shared among members of the target audience, while still providing for these content providers to be remunerated for their works. In such an environment, a particular embodiment of this invention targeting the distribution of mp3 format files may provide for the content provider to insert an apparent, audible and disruptive watermark after the first 30 seconds of song playback, such that it is still possible for the user to hear the music and determine whether he or she is interested in purchasing the music. If so, the user attains a license, which contains a cryptographic key, through some external licensing mechanism. Upon receipt of a valid license for the content, this invention then decrypts and removes the watermark during playback of the music using techniques taught later in this document.
It is necessary for this apparent, audible, and disruptive watermark, which is herein termed as a “noisemark,” to be sensitive to the dynamics of the song—for example, the amount of added noise which disrupts a symphony would not even be noticed in a heavy metal song. Additionally, for modem compression algorithms using frequency-domain compression techniques and adaptive compression, such as mp3, the frequency range and characteristics of such a noisemark should be recalculated every frame, for technical reasons, since it is difficult to maintain a high-quality output with a truly reversible watermark unless the noise introduced always has the same frequency profile as the music itself, as will be explained later in this application.
In one embodiment, this invention can use data embedding techniques such as those described in applicants’ before-mentioned earlier U.S. Pat. Nos. 6,748,362 and 6,768,980, which create a second data channel in a digital media stream. This second data channel can contain not only the reversible watermark described in this invention, but also embedded rich media, such as transactions, ads, interactive music videos, and the like. Instead of requiring a paid license, the media player can enforce viewing rich media content as a condition of licensing, to remove the noisemark and listen at full quality.
The current invention also interoperates with and is fully compatible with robust watermark DRM solutions, since the reversible watermark can ride on top of many types of robust watermark. Additionally, since what is created is a standard digital media stream, data envelope DRM mechanisms such as Apple Computer's “Fairplay” can transparently encapsulate it. This allows the creation of rich new licensing mechanisms which combine the strengths of all types of DRM approaches.
Another anticipated use of this invention is to provide perceptible less removable watermarks for media tracking purposes. For example, it is useful for a photographer to be able to submit watermarked photographs to a newspaper for review, but for that photographer also to be able to license removal of the watermarks once the agency has decided to purchase them for publication. This is also useful for firms selling stock media, so that they can authorize restoration of the media to a high-quality version, and do not have to ship out substitute, Un-watermarked or higher quality media to be used for final output.
It is accordingly a primary object of the present invention to provide a new and improved method of and apparatus for reversibly adding watermarks to media data, which shall not be subject to the above-described and other limitations and disadvantages of prior art approaches, through the novel use of watermarks that are added through reversible mathematical operations (such as addition and exclusive or (XOR)), wherein enough parameters are encoded in the watermarked media file to allow for the watermark to be regenerated, and then removed through reversing the aforementioned mathematical operation.
Other and further objects will be explained hereinafter and are more particularly delineated in the appended claims.
In summary, however, from one of its broader or generic aspects, the invention embraces the method of and apparatus for adding a reversible watermark to media data, that comprises, analyzing the media data to determine which watermarking elements are most suitable for adding, based on the intended use of the media and the codec with which it is being compressed; creating a watermark based on these parameters; adding the watermark to the media file using a reversible mathematical operation; encoding all necessary parameters for later use; either into the media, through steganographic means or additional data channels, or through storing in an external database; upon the user receiving the media, playing it with the watermark until a proper license is received; and if so received, recreating the watermark and then reversing the mathematical operation thereby to remove the watermark.
Best mode and preferred embodiments, techniques and designs for implementing the invention are hereinafter explained in detail.
The invention will now be described in connection with the accompanying drawings, which illustrate the following:
An important application of this invention is to add a perceptible, and in fact intrusive, watermark to media, which is freely distributed in order to encourage as many people as possible to sample it, and then decide to upgrade by removing the watermark, thus restoring the media to a high-quality version.
Although these watermarking techniques are sufficiently general and powerful that they can be applied to any form of digital media; in one embodiment, these are applied to frequency-domain-transformed data generated as part of the compression process, for example the DCT transformation used in many modem CODECs such as mp3 audio, MPEG video, JPEG pictures, etc. Such types of transformed data have unique qualities that make the present invention particularly useful with them, as is described later.
In one embodiment, the watermarking data can consist of random bit-values, in which case it will add some form of noise to the media. It can also consist of somewhat structured data. For example, in an audio application, a branded sequence of tones, embedded among other values, can signify the presence of a removable watermark. The addition of a voice prompt suggesting that the user purchase an upgraded version of the media is also possible, as is text or logos added to an image or video file.
A source music file 300, such as a PCM-encoded CD-audio file, is processed by the mp3 compression algorithm as described in:
As described in that specification, audio is broken down into 576-sample frames, and then each frame is transformed into the frequency domain using the Discrete Cosine Transform (DCT), and then scales those values using a scaling factor, resulting in some frequency values 320. This invention, in this embodiment, analyzes the resulting frequency values to generate a simpler parameterized representation of which frequency ranges have the most power, which we term a “frequency envelope” 330.
It should be noted that, if a watermarking that contains white noise is added to the frequency values of an mp3, this will automatically increase the compressed size, since that pollutes the higher-frequency values, making the music much harder to compress. In general, the mp3 codec adapts to this noise by decreasing the overall music quality to bring it back down to the same compressed size. In order to avoid this decrease in quality, it is necessary to compute a frequency envelope which restricts the added noise to having the same frequency characteristics.
The frequency ranges with such zero values vary from song to song and, in fact, from beat to beat, so a static approach will not work. The only way to properly embed reversible noise into an mp3, without degrading it by making the music more difficult to compress, is to dynamically compute a frequency envelope for each frame, which has the exact same frequency range as that frame, and which thus will tend to keep the compressed representation of that song about the same size.
In one embodiment, this frequency envelope comprises a concise description a few bytes long that describes the sizes of various sub-groups of frequencies within this block. Such a description is described in greater detail later in this document. A random number is used as a parameter for the seed to generate small watermarking values for each frequency of the frame. Where the intent is to mark the audio with random low-bit noise, this is a straightforward random generator. Where the intent is to mark the audio with a series of tones or other understandable content, the watermarking content is shaped by the randomly generated noise, so that the watermark is thereby difficult to remove. This watermarking content is then shaped by the parameters of frequency envelope so that the watermark impacts those areas of the music where there is already the most energy 340.
The watermarking noise is then combined 350 with the frequency values of that frame of music, in one embodiment by using the bit-wise XOR operation, which is easily reversible by re-applying it. In another embodiment, the watermarking noise is added to the frame of music, though any reversible mathematical operation can be used in this invention. Since it is possible for the combined new values to exceed the range of representation of values in this format, if this occurs, either the description of the frequency envelope can be amended to remove such values from the watermarking noise, or a new random seed can be chosen and the watermark re-applied until the combined new values no longer exceed the range of allowable representation.
Finally, the parameters necessary to regenerate the watermarking noise created the previous step are encrypted and stored in the mp3 360. These parameters may include: the random seed-value, the watermark envelope, an identifier for any understandable content added, and any areas excluded from watermarking, and are generally 4-12 bytes per frame. Any available encryption technique known to those skilled in the art can be used to encrypt these parameters, including but not limited to DES, IDEA, Blowfish, RSA, PGP, etc. In one embodiment, these parameters are stored using a second data channel, as described in our earlier cited U.S. Pat. Nos. 6,748,362 and 6,768,980. Alternatively, the data can be stored by placing it before a SYNC value, as described in the ID3v2 specification:
ID3v2 spec: http://www.id3.org/easy.html and http://www.id3.org/id3v2.3.0.html
The value may also be stored in the ancillary data field described in the mp3 specification, or can be stored at the end of the music file, after all frames of music. Any other mechanism in the format that provides for an additional channel of data may be used, for example when such audio is encapsulated within another media stream such as Quicktime or MPEG-4.
Finally, at the completion of the modified mp3 encoding process, the result is an mp3 with an embedded reversible watermark 370.
These same and similar techniques can be used by one skilled in the art to apply this technique to any audio compression format, including but not limited to the well known AAC, ATRAC, ADPCM, GSM, CDMA, etc. formats. We briefly outline the main modifications to the approach necessary for each class of formats:
Frequency-domain formats, such as AAC, ATRAC, etc., all use the same basic series of steps (with perhaps other transforms than the DCT transform used in mp3 used to transform into the frequency domain) outlined in
In signal-domain formats, such as ADPCM and PCM, etc., the noise is added in the signal domain of the audio signal. In such a case, the DCT transform is not part of the encoding process, but a frequency transform may still be used as part of the process to determine the amount of noise to add to the audio to ensure that it is perceptually significant.
In a vocoder format, such as GSM or CDMA, the analysis process, instead of using a frequency transform, uses the characteristics of the vocoder parameters to determine how to add perceptually significant noise to the signal, and then such watermarking noise is added to selected vocoder parameters.
A source video file 400, such as a DV-encoded video file, is processed by the MPEG video compression algorithm as described in the previously referenced MPEG Spec-ISO/IEC 11172, part 1-3.
In the MPEG encoding process 410, a sequence of video frames is compressed into multiple types of frames: I, B, and P frames. Intra-coded, or I-coded frames are stand-alone frames, coded without respect to other frames. Predictive-coded, or P-coded frames provide for motion-compensated prediction from an I or another P frame. Bidirectionally-coded, or B-coded frames sit between two I or P frames, and provide forward and backward prediction relative to an I or P frame. Regardless of the type of frame, blocks within the frame are encoded using similar techniques. A frame of video is broken down a macroblock, which contains six 8 by 8 blocks of pixels. Four of these represent Y, or luminance values for a region, and the other two represent Cb and Cr chrominance, respectively. Each of these blocks is processed in the same way, by transforming this 8 by 8 block of values using the DCT transform, scaling it by a scaling factor, and then scanning it in zig-zag order to generate a one-dimensional array of coefficients, resulting in some frequency values 420.
As in
A random number is used as a parameter for the seed to generate small watermarking values for each frequency of the video frame. Where the intent is to mark the video with random low-bit noise, this is a straightforward random generator. Where the intent is to mark the video with a series of recognizable shapes such as letters, the watermarking content is shaped by the randomly generated noise, so that the watermark is thereby difficult to remove. This watermarking content is then shaped by the parameters of frequency envelope so that the watermark impacts those areas of the video where there is already the most energy 440.
The watermarking noise is then combined 450 with the frequency values of that frame of video, using techniques previously described with step 350. And finally, these parameters are encrypted and stored in the video 460, as described in the previous step 360.
Finally, at the completion of the modified MPEG video encoding process, the result is an MPEG video with an embedded reversible watermark 470.
These same and similar techniques can be used by one skilled in the art to apply this technique to any video compression format, including but not limited to the well known MPEG-2, MPEG-4, H.264, VC-1, etc. formats. Because such approaches typically use very similar techniques to that used in the MPEG format, they are straightforward for someone skilled in the art to apply.
A source image 500, such as a raw image file, is processed by the JPEG image compression algorithm 510. The JPEG algorithm contains several types of encoding schemes, but in the most commonly used in, as in the MPEG algorithm, an image is transformed into a different color space, in this case YUV. Each 8×8 block of Y values forms a block, and depending on the downsampling technique used, 8×8 blocks of U and V values are created by scaling from either 8×16 or 16×16 blocks of U and V values. Each of these blocks is processed in the same way, by transforming this 8 by 8 block of values using the DCT transform, scaling it by a scaling factor, and then scanning it in zig-zag order to generate a one-dimensional array of coefficients, resulting in some frequency values 520.
Frequency value analysis and watermarking proceed in steps 530 and 540, as in the previously described steps 430 and 440, respectively.
The watermarking noise is then combined 550 with the frequency values of that image, as in the previous step 450. And finally, these parameters are encrypted and stored in the video 560, as described in step 460. The result is an JPEG image with an embedded reversible watermark 470.
These same and similar techniques can be used by one skilled in the art to apply this technique to any image compression format. Because such approaches typically use very similar techniques to that used in the JPEG format, they are straightforward for someone skilled in the art to apply.
Media containing an embedded reversible watermark 600 is played on a Media Player 610, consisting of, for example, a computer device, a portable media player, or a portable gaming device. When this media is played on the Media Player and the user does not already have a license for that content, the user is presented with the option to upgrade that content by requesting a license to play that content at full quality. Such a license is requested by using a Network 620 to send the Request for License 630, which contains but is not limited to such information as the identifier of the media, the type of license requested, payment information if needed, and the device or devices for which the license is requested. This is sent to one or more License Servers 640, which are able to query the License Database 650 for the necessary information. The Returned License 660 contains all necessary information to remove the watermark. For cases where the watermarking parameters are stored in encrypted form inside the media, this may consist solely of a decryption key to decrypt said parameters. In other cases where it is not suitable to store these watermarking parameters inside the media, the License Database can contain these watermarking parameters and return them as part of the Returned License. In either case, the result is that the media player uses the process described in
Further modifications will also occur to those skilled in this art, and such are considered to fall within the spirit and scope of the present invention as defined in the appended claims.