The present invention is related to the field of video steganography, watermarking, digital video copyright protection methods and devices and, more particularly, to methods and security devices for electronic document authentication and video copyright protection.
In the following disclosure, non-patent publications are cited by a number, e.g. [1], which refers to the section “Cited non-Patent Publications” at the end of the description.
Electronic documents are today used in many forms such as e-bills, e-tickets, and e-identity cards. Many people are holding such digital documents on their computers and smart phones instead of printing them. For example, in the airports, people prefer scanning their boarding passes through their smartphones. Even if these documents are stored digitally, most of them have intermediate security features that are designed for the printed versions. The possibility of counterfeiting these documents digitally creates an important problem due to the availability of digital tools and image manipulation software. In a close future, many documents will be stored and processed electronically on smartphones and tablet screens. In this context, the present invention discloses a new secure document encoding and authentication method for documents that are presented on the screens to the authorities.
Video copyright protection is a very important problem in the movie industry. Movie revenues decline because of pirating. Movie pirates can copy an original movie from different sources.
One method is to directly copy the movie (e.g DCP) that is projected in the movie theatres. The second method is ripping of DVDs or Blu-rays. The third method is direct copy of the movie on demand platforms such as Netflix, Amazon Prime, and Hulu. After copying, these movies are distributed illegally on the streaming platforms. For preventing all three methods, it is important to detect the identity of the pirate and the source of piracy. The present invention can be used as a video seal within the title or credits section of the movies. The video seal secretly transfers the identity of each person and organisation that distributes the medium. Once the movie is found in the other streaming platforms, the origin of the piracy can be detected easily through this video seal by using a conventional camera.
Steganography is a technique used for secret communication. The hidden message and visible content can be unrelated. The main concern of steganography is the undetectability of the hidden content. Many steganography methods act on the spatial domain of the image [1,2]. In many methods, the hidden content is embedded by changing the least significant bits of pixel values. Embedding at the spatial level is sufficient to deceive the human visual system. However, the resistance of these methods to attacks are weak. More advanced methods in steganography uses the spatial frequency domain where embedding is performed at the cosine transform level [3,4]. McKeon [5] shows how to use the discrete Fourier transform for steganography in movies. Some adaptive steganography methods consider the statistics of image features before embedding the message. For example, the noisy parts of an image are more suitable for embedding than the smooth parts [6].
Digital watermarking is used for the protection of digital content. Different than steganography, the visible content is more important than the hidden content. The strength of watermarking methods is related to the difficulty of removing hidden content from the visible content. The watermark aims at marking the digital content with an ownership tag. Copyright protection, fingerprinting to trace the source of illegal copies, and broadcast monitoring are the main purposes of digital watermarking [7]. In reversible watermarking techniques, a complete restoration of the visible content is possible with the extraction of the watermark [8]. Several approaches use lossless image compression techniques to create space for the watermarking data [9,10]. Although many different algorithms are used, the main goal of all reversible watermarking methods is the same: avoiding damaging sensitive information in the visible content and enabling a full extraction of the watermark and original data. For the extraction of the watermark, a retrieval function is required. Complex embedding functions result in complex retrieval functions requiring special software. This is one of the disadvantages of digital watermarking techniques. Although they provide a high level of security, the originality cannot be controlled rapidly.
Many patents exist in the video watermarking and steganography domains. U.S. Pat. No. 6,557,103 to Boncelet et. al. presents a data hiding steganographic system which uses digital images as a cover signal embeding information to least significant bits in a way that prevents humans to recognize it visually. In some inventions, multiple bit auxiliary data is embedded into the video that can only be decoded with an intermediate function (U.S. Pat. No. 20,070,223,592 to Rhoads). U.S. Pat. No. 6,559,883 to Fancher et. al. presents a system specifically for preventing movie piracy in movie theaters. The system is formed by an encoding system generating infrared pattern and a display showing it. A human observer viewing this display cannot recognize infrared patterns but once the display is recorded by a camera, the infrared patterns become visible. U.S. Pat. No. 20,090,031,429 to Zeev creates a predetermined pattern in the unreadable part of the storage medium which is configured to be only perceived by a media reader having a special setup. This allows only authenticated people to read the media files. Another invention (U.S. Pat. No. 6,529,600 to Epstein and Stanton) presents a method and device against movie piracy that varies frequently the frame rate, line rate, or pixel rate of the projector.
Our method is not directly competing with conventional watermarking and steganography. We generate synthetic video seals hiding visual information that can be revealed with a standard camera. We present a complicated encoding method but a very simply decoding method. Most stenographic methods use very complex decoding procedures. In contrast, our method aims at revealing information without using any decoding algorithm, i.e. by long exposure photography. Therefore, the present invention differs strongly from existing visual watermarking or steganography methods.
By exploiting the limitations of the human visual system with respect to the temporal domain, we design an algorithm for creating special synthetic video seals, that we call tempocodes. Such a synthesized video either appears as spatial noise or carries a visible message different from the hidden one. If the correct exposure time is set, the hidden image is revealed by a camera.
The present invention discloses a method of hiding an image into a synthetic video that is generated from that image by applying an expansion function. This function expands the image intensity values of pixels in the time domain by varying them from the original intensity values but still ensuring that the integration of the variations over time yields the original intensity values. The hidden image does not appear neither spatially on the frames of synthetic video (e.g. by pausing the video to check a current frame), nor by the eye integrating successive frames temporally (e.g. by a human watching the video on a video player).
The encoding technique is complex. It includes a multi-frequency decomposition operation with three possible temporal expansion functions. A first encoding technique consists in generating the synthetic video with a random function in the multi-frequency domain, resulting in spatially and temporally varying noise. The second encoding technique creates the synthetic video in the form of a sinusoidal wave in the multi-frequency domain that appears as spatial noise evolving smoothly in time. The third encoding technique enables generating synthetic videos combining multi-frequency domain decomposition, random expansion function and dithering function, yielding smoothly varying tiny structures having the form of symbols, graphic elements, shapes, text, or images.
The decoding technique is very simple and differs from watermarking methods. The presented expansion function ensures that the integration of the synthetic video, i.e. the average over the successive frames, yields the original hidden image. This enables revealing hidden images by using conventional cameras having an adjustable exposure time feature. Once the exposure time is set according to the duration of the video, taking a photo of the video that is running on the display reveals the hidden image.
One advantage of the present invention is that the hidden image cannot be revealed by the human eye even if the video is observed with at high or low frame rates. The human visual system has the ability of averaging the successive frames within a time interval of about 40 ms. This enables the perception of smooth motion in videos. When our synthesized videos are displayed on displays having a high frame rate, there is a danger of revealing the hidden image because of the temporal integration capability of the human eye. However, because of the decomposition of the image to be hidden into frequency bands and the expansion with variable amplitude signals, the hidden content is not revealed, even when watching the video on a very high frame rate display.
A further aspect of the present invention is a method to generate synthetic videos hiding the image in multi-colour. To generate multi-colour videos, the expansion function is applied to each colour channel separately in the multi-frequency domain.
Synthetic videos that are generated by the present invention can be used as a security feature in electronic documents such as electronic tickets and identification cards. Another usage of the present invention is against movie piracy. These synthetic videos hiding the identity of the movie customer can be embedded in the credits or title sections of movies or videos. In case of illegal distribution of such a movie, the video seal will facilitate the identification of the pirate distributing the movie illegally.
One aspect of the invention is directed to a method for generating, in a computing system, a synthetic electronic video comprising a plurality of sequential video frames containing a hidden image that is not ascertainable by the naked eye of a human observer when the video is played on an electronic display, the method comprising the steps of:
The invention method may further include a method of recovering the hidden image comprising: (d) averaging said plurality of sequential video frames and recovering thereby the hidden image.
Step d) may be performed by a camera that captures the video played on an electronic display and combines the plurality of sequential video frames into a still image that reveals the hidden image.
The electronic display may be a device selected from a set of TV, computer display, tablet, smartphone, and smart watch.
In an advantageous embodiment, the expansion function may be selected from the set of
In an advantageous embodiment, the camera is selected from a set of
In an advantageous embodiment, the method includes, before or during step (a), reducing the contrast of the hidden image, and after step (d) increasing the contrast of the recovered hidden image.
In an advantageous embodiment, the expansion function may be applied to each color channel separately to generate said synthetic video in color.
The method according to an aspect of the invention may further include embedding the synthetic electronic video within a classical video or movie.
A further aspect of the invention is directed to a computing system operable for generating a synthetic electronic video comprising a plurality of sequential video frames containing a hidden image that is not ascertainable by the naked eye of a human observer when the video is played on an electronic display, said computing system comprising software modules operable for:
The computing system may further comprise a camera operable for capturing and averaging said synthetic video frames, thereby recovering the hidden image.
A further aspect of the invention is directed to a synthetic electronic video comprising a plurality of video frames containing a hidden image that is not ascertainable by the naked eye of a human observer when the video is played on an electronic display, and wherein the hidden image is revealed by averaging the plurality of video frames of said video.
The synthetic electronic video may advantageously be embedded within a classical video or movie.
In an advantageous embodiment, the hidden image does not appear in any single video frame.
In an advantageous embodiment, the synthetic electronic video comprises a dynamically evolving message different from the hidden image, where said dynamically evolving message comprises a visual element selected from the set of text, logo, graphic element, and picture.
For a better understanding of the present invention, one may refer by way of example to the accompanying drawings, in which:
synthetic
The goal of the present work is to hide an image in a video stream under the constraint that the temporal average of the video reveals the image. Specifically, the input image should remain invisible in each frame of the video and should not become visible due to the temporal integration of consecutive frames by the human visual system (HVS). In order to achieve this, a visual masking method that acts both in the spatial and in the temporal domain is required. Spatial masking inhibits orientation and frequency channels of the HVS. In temporal masking, any information coming from the target image by temporal averaging should be masked.
Our method hides an input image within a video. The image is revealed by averaging, which is either achieved by pixelwise mathematical averaging of the video frames or by long exposure photography. We call the video hiding the input image “tempocode” or equivalently “tempocode video”.
Regarding the vocabulary, we also call the image to be hidden within the tempocode video “target hidden image” or simply “target image”. Sometimes we refer to one pixel called “target pixel” of the target image or of an instance of the target image that has been obtained by processing it, for example by decomposition into frequency bands. A target pixel has a “target intensity value” or simply a “target intensity”. In analogy with the science of signal processing, the term “target signal” or simply “target” is used for the signal to be hidden. In the present disclosure, there is an implicit analogy between the term “target signal” and “target image” or between “target signal” and “target image pixel”.
In order to create such tempocodes, we apply the following self-masking approach. We first decrease the dynamic range of the input image and decompose it into a certain number of frequency bands. For each frequency band of the contrast reduced input image, we generate temporal samples by sampling a selected expansion function, whose integration along a certain time interval gives the corresponding frequency band. We then reconstruct each video frame from the temporal samples derived from the frequency bands. We consider the following expansion functions: random function, sinusoidal composite wave function, and a temporally-varying dither function. Using these functions we generate different masking effects such as smoothly evolving videos and videos with visible moving patterns.
We now describe our approach for hiding an image in a video. The hidden information is not perceivable by the human eye but the pixelwise average of the video over a time interval ranging between 2 seconds and 20 seconds reveals the hidden image. With the correct exposure time, conventional and digital cameras can detect the hidden information. Software averaging over the video frames also reveals the image.
The main challenge resides in masking the input image by spatio-temporal signals that are a function of the input image. To achieve this, we present a visual masking process that enables hiding the input image for both the spatial and the temporal perception of human beings.
In conventional visual masking methods, the mask and the target signal to be hidden are different stimuli. However, in our method, the mask is constructed from the target image. We call this approach “self-masking”.
We initially define the problem in the continuous domain. A constant target signal p is reproduced by the integration of ƒ(t), a time dependent expansion function, over a duration τ:
In order to create spatial noise, a phase shift parameter δ is selected randomly at each spatial position. We assume that the display is linear. The target signal p, the duration τ, and the phase shift δ are known parameters. The challenge resides in finding a function ƒ(t+δ), satisfying this integration and ensuring that the target signal is masked at each time and within each small time interval (˜40 ms). We present the different alternatives for the expansion function ƒ(t+δ) in the “Expansion Functions” section.
In practice, our signals are not continuous since the target image to be hidden is a digital image and the mask is a digital video designed for modern displays. Let I be a target image to be masked (i.e. hidden) into a video V having n frames. Initially, we reduce the contrast of the input image I by linear scaling and obtain the contrast reduced image Ic. This is required in order to reach the masking threshold, i.e. the threshold where the target image is hidden.
A multi-band masking approach is required to mask both high frequency and low frequency target image contents. Applying the expansion function solely on input pixels would only mask the high frequency content. Therefore, we decompose the contrast reduced target image Ic into spatial frequency bands. A Gaussian pyramid is computed from the contrast reduced target image Ic. To obtain the frequency bands, we compute the differences of every two neighbouring pyramid levels. In practice, we use a standard Laplacian pyramid with a 1-octave spacing between frequency bands, see reference [11] herein incorporated by the reference. Finally, for each contrast reduced pixel value Icl(x,y) in each band l, we solve a discretized instance of Eq. (1). Let t1, . . . tn be a set of n uniformly spaced time points (
where vil(x,y) is the frame Vi of frequency band l at time point ti of the resulting video and where (x,y) indicates the pixel location. A different phase shift value δl is assigned to each pixel (x,y) in each band l.
Once all bands vil(x,y) of each frame vi(x,y) are constructed, we sum the corresponding bands to obtain the final frame at time point ti:
where k is the number of bands and (x,y) is the position of a given pixel within the frame.
For decoding purposes, the average of the tempocode frames 219 gives the contrast reduced input image Ic from which the input I 220 is recovered. In the present example, the resulting video has n=24 frames and is constructed with k=7 frequency bands. In
A masking signal with a certain contrast can mask a target signal having a contrast smaller than the masking threshold. In the present invention, we always generate our mask with 100 percent contrast in order to enable a maximal contrast of the target image to be hidden. To ensure that the target image is hidden, we first reduce the contrast of the target image I and move the contrast reduced image to the center of the available intensity range. The resulting contrast reduced image Ic is:
where α is the reduction factor and 0<α<1.
The amount of contrast reduction a depends on the contrast, spatial frequency, and orientation of the image to be hidden.
It is very important to select the correct contrast reduction factor α to reach the masking threshold. However, the input image consists of a mixture of locally varying contrasts, spatial frequencies, and orientations that affect masking. The contrast reduction factor α should be selected by considering the local image element that requires the largest amount of the contrast reduction. Once this image element is masked, all other image elements are masked as well.
Many different types of temporal expansion functions ƒ(t+δ) fulfill the requirements of Eq. (1). We can define a random function with uniform probability, a Gaussian function, a Bezier curve, a logarithmic function, or periodic functions such as a square wave, a triangle wave, or a sine wave. However, the following constraints need to be satisfied:
In the following, we describe random, periodic, and dither expansion functions.
Our random expansion function is made of n random uniformly distributed samples varying temporally for each pixel of each band (
If the contrast of the target image is sufficiently reduced, the random function masks to a large extent the target image. However, this is only true when each frame is observed separately. When all frames are played as a video (e.g., at 30 frames per second), the target image might be slightly revealed. This is due to the fact that the target image is well masked spatially but not temporally. The human visual system has a temporal integration interval of 40±10 ms. Therefore a few consecutive frames can be averaged by the human visual system.
As we have seen in the previous section, a temporally continuous low frequency masking signal is required to avoid revealing the target signal by temporal integration of the human visual system. We thus propose a periodic function that results in spatial discontinuity and temporal continuity of the resulting video.
We use a sine function as our periodic function. Spatial juxtaposition of phase-shifted sine functions may reveal local parts of the target image. Therefore, instead of using a regular sine function, we create a sinusoidal composite wave by varying the function in amplitude for a given number of temporal segments.
In order to create m sine segments varying in amplitude, we first generate m uniformly distributed random temporal parent-samples pjl(x,y) for each pixel of each band ensuring that their mean is Icl(x,y):
Since we have a small number of parent-samples (e.g. 4 samples), the mean Icl(x,y) will not be exactly achieved. Therefore, we redistribute the error across the samples. Next, for each parent-sample pj, we establish a function ƒj(t+δ) in the form of Eq. 1 such that:
where
is the start time,
is the end time, j∈[1, . . . , m] is the index of each parent-sample, and i is the total duration of the video to be averaged.
We define the expansion function ƒ1(t+δ) for each parent sample as a continuous section of a sine in a form that is analytically integrable and lies within the allowed intensity range for most of its values.
where kj is the amplitude and T is the period. As shown in
By inserting Eq. 8 into Eq. 7, we can express kj in function of the other parameters:
For each pixel of each frequency band, these m functions ƒj(t+δ) of parent samples p1 416, p2 417, p3 418, p4 419 are sampled by
video frames 421, see
In order to ensure a phase continuity between the sinusoidal segments, we select the phase shift δ randomly only for the first sinusoidal segment ƒj(t+δ). For all other functions associated to parent samples we use the current phase δ and the current period T. Nevertheless, due to the variations of the amplitudes, we obtain a non-continuous composite signal. These discontinuities 413a, 413b, 413c appear at the junctions between successive sinusoidal segments (see
To remove the discontinuities at the junction points, we apply a refinement process by using differential values. From the samples of the composite wave, we first calculate the differential values by taking the backward temporal differences: Δvil(x,y)=vil(x,y)−vi-1l(x,y) (
With the blended differential values, we re-calculate the intensity values for each pixel of each band by minimizing the following optimization function:
where n is the total number of frames (
This optimization is solved as a sparse linear system. We obtain a smooth signal (
The deviations from the average Icl(x,y) (
As shown in
A sinusoidal composite wave enables masking the target image both spatially and temporally. However, the visible part, the tempocode video, does not convey any visual meaning. We thus propose to replace the spatial noise with meaningful patterns. For this purpose, we make use of artistic dither matrices which were described in U.S. Pat. No. 7,623,739 to Hersch and Wittwer, herein incorporated by reference.
When printing with bilevel pixels, dithering is used to increase the number of apparent intensities or colors. A full tone color image can be created with spatially distributed surface coverages of cyan (c), magenta (m), yellow (y), and black (b) inks. The human visual system integrates the tiny c,m,y,k inked and non-inked areas into the desired color.
A dither matrix includes in each of its cell a dither threshold value. These dither threshold values indicate at which intensity level pixels should be inked. Artistic dithering enables ordering these threshold levels so that for most levels the turned-on pixels depict a meaningful shape. We adapt artistic dithering to provide a visual meaning to tempocode videos.
We repeat the selected dither matrix (
Instead of finding such a dither input intensity 518, we directly assign white or black to the successive temporal dither threshold levels as follows:
A smooth transition between frames is desirable. Therefore, our expansion function should be continuous. This is ensured by the smooth displacement of the dither matrix.
Expansion by simple dithering satisfies one of our conditions, i.e., the average of the frames yield the target image (Eq. (2)). However, a multi-band decomposition cannot be carried out with the dithered binary images since they are bilevel. As shown previously, the multi-band decomposition is an important component for masking the target image. To overcome this problem, we create two parent frames IcP1 and IcP2 (
frames by dither expansion using the temporal dither function as described above. Thanks to the dither expansion we get n dithered frames forming our final video V in which the target image is successfully masked, as shown for a single pixel in
As an example,
The methods for generating tempocodes are described for grayscale target images. For color images, we use exactly the same procedure and apply the self-masking method to each color channel separately.
As a further example,
The present invention introduces a screen camera channel for hiding information by simple averaging. The encoding is complex, but the decoding is very simple. Thus, hidden images can be revealed by non-expert users but not created. The present method does not compete with existing watermarking or stenographic methods that require complex decoding procedures. It can be rather used as a first-level secure communication feature. More and more security applications, such as banking software, use smartphones to identify codes that appear on a display. In the present case, instead of directly acquiring the image of a code, the smartphone might acquire a video that incorporates that code. For example, instead of showing a QR code on an electronic document directly, our method can be used to hide it. Hiding a message into a video can be seen as one building block within a larger security framework. Furthermore, tempocodes can be used as video seals in movies against piracy. A video seal can be placed in the credits or titles section (
The final tempocode video is stored on disk 94 or transmitted over the network 96 to another computer in order to be played or to be inserted into a movie. For the display of the tempocode video, a computing system (e.g. TV, laptop, tablet, smartphone, smart watch) with a display 95 is required. The display shows the client's tempocode that has been received through the network or is stored in his memory. Authentication can be performed by an external camera which is not part of this computing system or by an other computing system (e.g. laptop, tablet, smartphone) equiped with a digital camera.