The present application claims priority to Chinese Patent Application No. 202210955381.3, entitled “METHOD FOR WATERMARKING DEPTH IMAGE BASED ON MIXED FREQUENCY-DOMAIN CHANNEL ATTENTION”, filed on Aug. 10, 2022 before China National Intellectual Property Administration, the entire contents of which are incorporated herein by reference.
The present disclosure relates to the field of artificial neural networks and digital image watermarking, and in particular to a method for watermarking depth image based on mixed frequency-domain channel attention.
In recent years, with the success of depth neural network in computer vision tasks, the combination of depth neural network and digital image watermarking algorithm has become a hot direction in the field of information hiding. Not only the combination of depth neural network and digital image watermarking algorithm can protect the copyright information of images, because of the powerful learning ability of neural network, the trained watermarking algorithm model can also be applied to most image scenes. In addition, the neural network can fit the embedding and extraction of watermark information well, and make the original watermark embedding, image noise and watermark extraction participate in the training of the neural network. Compared with traditional methods in robustness and invisibility has been improved. The selection of channel features has a certain role in image watermarking, and selecting the frequency-domain components suitable for embedding watermark as the weight of channel features in the frequency-domain channel attention module can improve the performance of the watermark model. However, the current watermarked image after JPEG compression is not effective in extracting watermark, and the quality of watermarked image is poor.
In order to solve the problems of poor watermarked image quality and poor watermarked image extraction after JPEG compression, the present disclosure provides a method for watermarking depth image based on mixed frequency-domain channel attention, which combines the end-to-end depth watermark model with frequency-domain channel attention to expand an application range of the depth neural network in the field of image watermark, and designs a new encoder structure with the help of frequency-domain channel attention module, and finally obtains the watermarked image with higher quality and watermark information with better decoding effect.
The technical solution adopted by the present disclosure to solve the technical problem thereof is described in the following content:
A method for watermarking depth image based on mixed frequency-domain channel attention, comprising the following steps of:
Further, step 1 is specifically that the watermark information processor takes the watermark information as an input, diffuses the watermark information to each bit of information through a full connection layer, transforms the diffused watermark information from one-dimensional feature map form to a two-dimensional feature map form, and then generates a watermark information feature map through a diffusion convolution layer and an attention module.
Further, step 2 is specifically that the encoder takes the carrier image and the watermark information feature map as an input, and generates the watermarked image through a ConvBNReLU convolution block, a mixed frequency-domain channel attention module and jump connection.
Furthermore, a mixed frequency-domain channel attention module in the encoder is composed of two branches, wherein one branch is composed of a plurality of SENet attention modules, and the SENet attention modules use a global average pooling layer in a channel compression process, namely, take a lowest frequency component in a two-dimensional discrete cosine transform as a weight allocated to a channel feature; and the other branch is composed of an FCA attention module, wherein the FCA attention module generates 64 frequency-domain components divided according to 8×8 block mode of JPEG compression principle, and selects 16 low-frequency components as compressed weights of the FCA attention module according to a zigzag mode starting from the lowest frequency component; and feature tensors generated by the branch of the FCA attention module and the branch of the SENet attention module are then spliced in a channel dimension, and a ConvBNReLU convolution module is used for feature fusion.
Further, step 4 is specifically that the decoder takes the noise image as and input, and uses the ConvBNReLU convolution module and the SENet attention module to perform down-sampling to recover the watermark information.
Further, a loss function for training the encoder includes LE
L
E1=MSE(ICO,IEN)=MSE(ICO,E(θE,ICO,MEN))
L
E2=log(A(θA,IEN))=log(A(θA,E(θE,ICO,MEN)))
wherein ICO is the carrier image, IEN is the watermarked image, E represents the encoder, θE is a parameter of the encoder E, MEN is the watermark information feature map; A represents a countermeasure discriminator, and θA is a parameter of the countermeasure discriminator A.
Further, a loss function LD for training the decoder is:
L
D=MSE(M,MD)=MSE(M,D(θD,INO))
wherein M is original watermark information, MD is decoded and recovered watermark information, D represents the decoder, θD is a parameter of the decoder D, and INO is the noise image.
Further, a loss function LA for training the countermeasure discriminator is:
L
A=log(1−A(θA,E(θE,ICO,MEN)+log(A(θA,ICO))
wherein A represents the confrontation discriminator, θA is a parameter of the confrontation discriminator A, E represents the encoder, θE is a parameter of the encoder E, ICO is the carrier image, and MEN is the watermark information.
The technical solution adopted by the present disclosure has advantages compared with the prior art.
The channel attention is introduced to extract the feature of the carrier image, a plurality of frequency-domain components in the channel are used to reduce the amount of lost information in the encoding process, and 16 low-frequency components are independently selected as the weighting parameters of the channel attention, which is more robust to JPEG compression than the middle-frequency and high-frequency components.
A structure of two branches is designed. The two branches use different attentions to learn the feature map. The feature maps generated by the two branches are spliced in the channel dimension and then fused by the convolution layer so that the quality of the generated watermarked image is greatly improved.
The technical solution of the present disclosure will now be described more clearly and fully hereinafter with reference to the accompanying drawings, in which embodiments of the disclosure are shown. It is to be understood that the embodiments described are only a few, but not all embodiments of the disclosure. Based on the embodiments of the present disclosure, all other embodiments obtained by a person of ordinary skill in the art without inventive effort fall within the scope of the present disclosure.
The foregoing content is described in detail in the following text.
The watermark information processor is mainly responsible for processing the watermark information and inputting a processed feature map into the encoder. It receives a binary watermark information of length L composed of zeros and ones and outputs a watermark information feature map of size C′×H×W, where C′ is the quantity of channels of the feature map. H is a height of the feature map and W is a width of the feature map, and in particular, a randomly generated feature map with a length L of which the watermark information changes from one dimension to two dimensions has a size of {0,1}1×h×w, where L=h×w. It is then amplified by a convolution module ConvBNReLU consisting of a convolution layer with a convolution kernel size of 3×3, a batch normalization layer and an activation function ReLU, and its size is extended to C×H×W by several diffuse convolution layers. Finally, in order to expand the information more appropriately, the feature map of the watermark information is extracted by several SE attention modules.
The encoder E with parameter θE takes as inputs the RGB color image of size 3×H×W, i.e. the carrier image ICO and the watermark information feature map MEN, and outputs an encoded image of size 3×H×W, i.e. the watermarked image IEN. To better select the channel features, the encoder uses a mixed frequency channel attention module comprising a plurality of SE channel attention modules and a FCA frequency-domain channel attention module. The principle for the FCA attention module to select multi-frequency components is:
where bu,vi,j is a basis function of the discrete cosine transform, which removes some constant coefficients and does not affect the result, x2d is taken as an input of the discrete cosine transform, H is a height of x2d, W is a width of x2d, and u∈{0, 1, . . . , H)1}, v∈{0, 1, . . . , W)1}. The global average pooling operation is actually equivalent to the discrete cosine transform values when u=0 and v=0, i.e. the lowest frequency component:
The whole encoder consists of a plurality of ConvBNReLU convolution blocks with a convolution kernel size of 3×3, a mixed frequency channel attention module and a convolution layer with a convolution kernel size of 1×1. In the first step, it first magnifies the carrier image through the ConvBNReLU convolution block with convolution kernel size of 3×3, then uses the proposed mixed frequency channel attention module to ensure a invariant feature map size, and uses the ConvBNReLU convolution block with convolution kernel size of 3×3 to gather the feature maps obtained by the attention module. In the second step the watermark information feature map obtained from the watermark information processor and the previously output carrier image and feature map obtained by the mixed frequency channel attention module are input into the ConvBNReLU convolution block with a convolution kernel size of 3×3 for feature fusion. In the third step, the fused feature map and the carrier image transferred by the jump join are spliced into a new feature map, and are sent to a convolution layer with a convolution kernel size of 1×1 to obtain a coded image IEN. The encoder is trained to minimize the L2 distance between ICO and IEN by updating the parameter θE:
L
E1=MSE(ICO,IEN)=MSE(ICO,E(θE,ICO,MEN))
The robustness of the overall model is provided by the noise layer. The noise in the noise layer is selected from a specified noise pool, which takes as input the coded image IEN and outputs a noise image INO of the same size. In a training process of the model, for each batch of input encoded images, the noise layer randomly selects one of set noises for distortion to simulate a noise environment in a real scene.
The task of the decoder D with the parameter θD is to recover the watermark information MD of length L from the noise image INO, which part determines the ability of the whole model to extract the watermark. In the decoding stage, the noise image INO is input to a ConvBNReLU layer with a convolution kernel size of 3×3 and the obtained feature map is downsampled by a number of SE attention modules. Then, the multi-channel tensor is converted into a single-channel tensor through a convolution layer with a convolution kernel size of 3×3, and a shape of the single-channel tensor is changed to obtain decoded watermark information MD. The goal of training the decoder is to minimize the L2 distance between the original watermark information M and MD by updating the parameter θD:
L
D=MSE(M,MD)=MSE(M,D(θD,INO))
since it plays an important role in a bit error rate index, the loss function LD occupies the largest proportion of the total loss function.
the confrontation discriminator A is composed of a plurality of ConvBNReLU modules with convolution kernel size of 3×3 and a global average pooling layer. Under the influence of the counter network, the encoder will deceive the opponent as much as possible, so that the opponent discriminator cannot make the correct judgment on ICO and IEN, and update the parameter θE to minimize the loss function LE2, so as to improve the encoding quality of the encoder:
L
E2=log(A(θA,IEN))=log(A(θA,E(θE,ICO,MEN)))
the discriminator with parameter θA needs to distinguish ICO from IEN as a binary classifier. The goal of an adversary is to minimize the classification loss LA by updating θA:
L
A=log(1−A(θA,E(θE,ICO,MEN)))+log(A(θA,ICO))
The total loss function is L=λELE1+λDLD+λA,LE2, and LA is the loss function for the countermeasure discriminator. λE, λD and λA are the weight parameters of the respective loss functions, set to 1, 10 and 0.0001 in the training, respectively.
The above-mentioned design for a loss function is embodied in that the loss function is specifically two parts, one part being a loss function for the encoder and the decoder LE
In order to reflect the universality of this model, it is feasible to randomly select 10000 images from the image data set of ImageNet as the training set of model, and then randomly select 5000 images from the image data set of COCO as the verification set and 5000 images as the test set. Before the the input model training, the data set is pre-processed and cut to the size of 128×128, the batch size is set as 16, and the training run is set as 150. A dynamic Adam is selected for the optimization algorithm during training and a learning rate of 0.001 is set. For the test of JPEG compression noise, it is feasible to use the library function provided in PIL. During training, an embedding strength of the watermark information is set to 1. In order to measure the performance of the watermark algorithm, PSNR and SSIM are used to calculate the similarity between the carrier image and the watermarked image to represent the imperceptibility of the watermarking algorithm, and the error rate between the watermark information and the watermark information recovered by the decoder is used to represent the robustness of the watermarking algorithm.
Other methods are used for the test experiment under the training of JPEG compression noise. See Table 1 for the relevant data.
The settings of the single noise model and the mixed noise model are trained. The single noise model means that the noise layer only includes one kind of noise, and the trained watermark model only has strong robustness to the noise. Taking JPEG compression as an example, the setting of the noise layer is no noise, simulated JPEG-Mask and real JPEG compression. The reason for this selection is that the real JPEG compression is non-differentiable noise, and the model parameters fed back can not be added to the training of the model, while the simulated JPEG-Mask is only a JPEG compression template manually set which can not achieve the effect of the real JPEG compression, so the noise-free, JPEG-Mask and real JPEG compression are selected for hybrid training to maximize the simulation of real JPEG compression, and the intensity factor of JPEG compression is set as 50.
Selection of weights. The preset number of training rounds is 150, and after the training is completed, several training rounds corresponding to minimum values are selected from the recorded training logs according to the total loss of the verification set as the weights to be introduced into the model by the test.
12%
37%
11%
Methods for Testing. It is emphasized in the test that the watermarked image in the training process is different from that in the test process. In the training process, the watermarked image generated by the encoder is directly input into the noise layer to participate in the whole training, and in the testing process, the weight parameters of the watermark information processor, the encoder and the decoder are fixed; the difference value Idiff between the carrier image and the watermarked image generated by the encoder represents the watermark information; the Idiff is multiplied by the watermark embedding strength α and then added to the carrier image in the pixel dimension to generate a watermarked image for testing, namely, IEN=ICO+α×Idiff=ICO+α×(IEN−ICO); and in the training process, since the intensity factor α is 1, the intensity factor can be adjusted during testing to balance robustness and invisibility for different applications. After the parameters of the test are set, the training weights selected before are introduced into the test, and the results of images in the test set are averaged to represent the overall performance of the test.
Table 3 shows the results of comparing the encoded image quality after a single training for each noise, with the intensity factor adjusted so that the bit error rate approaches 0%.
Table 4 shows the results of tests at different quality factors and different intensity factors after training specifically for JPEG compression for noise.
It is to be understood that the above-described embodiments are merely illustrative for clarity and are not restrictive of the embodiments. It will be apparent to those skilled in the art that various other modifications and variations can be made in the present disclosure without departing from the scope or spirit of the disclosure. All embodiments need not be, and cannot be, exhaustive. Obvious modifications or variations are possible in light of the above-mentioned teachings.
Number | Date | Country | Kind |
---|---|---|---|
2022109553813 | Aug 2022 | CN | national |
Number | Date | Country | |
---|---|---|---|
Parent | PCT/CN2023/101599 | Jun 2023 | US |
Child | 18453846 | US |