REVERSIBLE DNA INFORMATION HIDING METHOD BASED ON PREDICTION-ERROR EXPANSION AND HISTROGRAM SHIFTING

CROSS REFERENCE TO RELATED APPLICATION

The present application claims priority to Korean Patent Application No. 10-2018-017337, filed Feb. 13, 2018, which is incorporated herein by reference.

TECHNICAL FIELD

The present invention relates generally to a reversible DNA information hiding method based on prediction-error expansion and histogram shifting, the method being capable of false start codon prevention, original sequence length preservation, high watermark capacity, and blind detection based on prediction-error expansion and histogram shifting without biological mutation.

RELATED ART

A DNA sequence consists of a coding DNA and a non-coding DNA, and watermarks are inserted into the two regions, respectively, such that data can be hidden. In the case of the coding DNA, a redundancy codon range is extremely small, and thus the coding DNA is not suitable for reversible watermarking. In the case of the non-coding DNA, a watermark available range is wide compared to the coding DNA due to no condition for protein code preservation, and thus the non-coding DNA is suitable for DNA reversible watermarking.

Lossless compression and difference expansion (DE)-based methods widely used in conventional reversible image watermarking have been proposed by T. Chen, et al. (reference [1]). A histogram-based reversible DNA watermarking method with a low modification rate of bases has been proposed by Huang, et al. (reference [2]). In this method, the modification rate of bases is low, but bpn is extremely low and a false start codon occurs, similar as Chen's method.

Furthermore, a piecewise linear chaotic map (PWLCM)-based information hiding method has been proposed by Liu, et al. (reference [3]). Information hiding methods for tamper location detection and restoration of a DNA sequence have been proposed by J. Fu (reference [4]) and Ma (reference [5]). These methods are for hiding data using substitution by complementary rule, and non-blind methods requiring a reference (or original) DNA sequence for extraction and restoration.

The foregoing is intended merely to aid in the understanding of the background of the present invention, and is not intended to mean that the present invention falls within the purview of the related art that is already known to those skilled in the art.

SUMMARY

Accordingly, the present invention has been made keeping in mind the above problems occurring in the related art, and the present invention is intended to propose a reversible DNA information hiding method based on prediction-error expansion and histogram shifting, the method being capable of false start codon prevention, original sequence length preservation, high watermark capacity, and blind detection based on prediction-error expansion and histogram shifting without biological mutation.

In order to achieve the above object, according to one aspect of the present invention, there is provided a reversible DNA information hiding method based on prediction-error expansion and histogram shifting, the method including: coding, at a first step, a four-letter base sequence of a non-coding region DNA to an n order code value; embedding, at a second step, multiple bits for each code value by a least square (LS) prediction error; embedding, at a third step, an n order watermark bit by non-circular histogram and circular histogram multi-level shifting; verifying, at a fourth step, occurrence of a start code of a watermarked intra code value and a watermarked inter code value.

At the first step, b may be a four-letter base b={‘A’, ‘T’, ‘C’, ‘G’}, b may be a base value of the b, x may be a base block consisting of n bases, x may be a code value for the base block x, and n may be a coding order. Coding to a 2n-bit code value x in units of the base block x consisting of the n bases may be performed as follows

$x = f (x) = \sum_{k = 1}^{n} (b_{k} \cdot 2^{2 (n - k)})$

where x=(b₁, b₂, . . . , b_n), x∈┌0,2²ⁿ−1┐. The bases of the base block may be restored from the code value x as follows f⁻¹(x)=x where b_k=(x>>2(n−k))%4 for k=1, . . . , n.

At the fourth step, preventing of a false start codon in the watermarked intra code value may include: generating a code value table containing the false start codon in advance; and embedding a watermarked code value not to contained in the code value table.

At the fourth step, preventing of a false start codon in the watermarked intra code value may include: when a previous watermarked code value x′_i−1is given, a number of embedded bits for a current processed code value is controlled such that the current processed code value x′_idoes not satisfy

x′
_i−1(n−1,n)∥x′_i(1,2)∈Z^c

if (x′_i−1%2⁴)=f(‘AT’)=1 and (x′_i>>2(n−1))%2²=f(‘G’)=3

if (x′_i−1%2²)=f(‘A’)=0 and (x′_i>>2(n−2))%2⁴=f(‘YG’)=7.

At the second step, the code value may be predicted through local prediction for each embedding region.

The present invention has been made keeping in mind the above problems occurring in the related art. According to the reversible DNA information hiding method based on prediction-error expansion and histogram shifting, false start codon prevention, original sequence length preservation, high watermark capacity, and blind detection based on prediction-error expansion and histogram shifting are possible without biological mutation

BRIEF DESCRIPTION OF THE DRAWINGS

The above and other objects, features and other advantages of the present invention will be more clearly understood from the following detailed description when taken in conjunction with the accompanying drawings, in which:

FIGS. 1A and 1B are views illustrating a general 2-bit base value and a 2n-bit value for n order base blocks, respectively;

FIGS. 2A and 2B are views illustrating occurrence probability of a false start codon in an intra code value and in inter code values, respectively;

FIGS. 3A and 3B are views illustrating, with respect to the coding order n with x=1, a ratio R_region(n) of the number of embedding regions and a ratio R_base(n) of the number of bases, and a code value level and the number of code values when the number of bases is 100;

FIGS. 4A and 4B are views illustrating an expandable region of x for a prediction value {circumflex over (x)}, and the number of expandable bits of x with the prediction value {circumflex over (x)}=0, 128, 255

$α (k) = sgn (d) \sum_{i = 0}^{k - 1} 2^{j} ω_{j + 1},$

when all watermark bits have values of one, w={1}₁^2n-1.

FIGS. 5A and 5B are views illustrating code values of ‘AE017199’ and ‘CP000473.1’ sequences, histograms of the code values, successive predictor difference histograms when the coding orders are n=3 and n=4;

FIGS. 6A and 6B are views illustrating mean error histograms of LS predictors, mean predictors, and successive predictors of ‘AE017199’, ‘CP000473.1’ sequences when the coding orders are n=1 and n=4;

FIG. 7 is a view illustrating shift of values where differences from a center value R_iare d>0 and d<0 on an arbitrary section P_iof an n order code value histogram domain Z;

FIGS. 8A and 8B are views illustrating code value shifting on a current section P_iand left and right adjacent sections P_i−1and P_i+1, and code value shifting between each section and left and right adjacent sections on the entire sections; and

FIG. 9 is a view illustrating data hiding based on circular histogram shifting.

DETAILED DESCRIPTION

According to a preferred embodiment of the present invention, a reversible DNA information hiding method based on prediction-error expansion and histogram shifting is a method using difference expansion (DE) of a multi-bit base code value and histogram shifting, and main features of the present invention are as follows.

1. Blind Reversibility: a reversible watermark is hidden without change in the length of a DNA sequence and in amino acid, and extraction and restoration are possible without an original DNA sequence.

2. Watermarking Usability: a base bit sequence of a bit is encoded to a code value sequence of 2n bits, such that reversible watermark hiding, extraction, and restoration processes are easily performed.

3. Watermark Capacity: based on DE and histogram shifting of a code value sequence, multi-bit embedding for each target code value is enabled, and thus watermark capacity is increased.

4. No false start codon: through a false start codon—code value table and comparison-search between adjacent code values, occurrence of a false start codon in an intra code value and inter code values is prevented.

Before description of the present invention, symbols used in the present invention are defined as follows.

- A DNA sequence consists of a non-coding region D^ncand a coding region D^c.
- The non-coding region D^xis divided into an embedding region Γ and a non-embedding region Γ^C=D^nc−Γ.
- An embedding target region Γ has regions D_iof |Γ| numbers, and each region D_iconsists of bases of |D_i| numbers; Γ={D_i}_i=1^|Γ|, D_i={b_j}_i=1^|D^t^|.
- b is a four-letter symbol base b={‘A’, ‘T’, ‘C’, ‘G’}, and b is a base value of b.
- x={b₁, b₂, . . . , b_n} is a base block consisting of n bases, and x is a code value for the base block x. Here, n is called a coding order.
- x′ is a watermarked code value, and x′={b′₁, b′₂, . . . , b′_n} is a base block of x′.
- W={w₁, w₂, . . . , w_N_w}, w∈[0,1] is a watermark bit string to be hidden.

Cardinality |D| of a matrix L indicates the number of elements or length of L.

1. Coding of Four-Letter Base

For ease of watermarking signal processing on a four-letter base sequence, multi-bit coding processing is essential. In this section, the multi-bit coding processing for ease of watermarking signal processing and false start codon prevention will be described.

1-1. Coding Based on a Coding Order

Generally, a nucleotide base is expressed as four letters, b=(A, T, C, G) as shown in FIG. 1A, that are expressed as four decimal numbers or 2-bit binary numbers.

b=(0,1,2,3)₁₀=(00,01,10,11)₂←b=(A,T,C,G) (1)

For ease of signal processing, rather than a 2-bit value, as shown in FIG. 3B, expansion to a value expressed in multiple bits of two or more bits is required. In the present invention, coding to a 2n-bit code value x in units of a base block x consisting of n bases is performed as follows.

$\begin{matrix} x = f (x) = \sum_{k = 1}^{n} (b_{k} \cdot 2^{2 (n - k)}) where & (2) \\ x = (b_{1}, b_{2}, \dots, b_{n}), x \in ⌈ 0, 2^{2 n} - 1 ⌉ \end{matrix}$

The bases of the base block are easily restored from the code value x as follows.

f
⁻¹(x)=x where b_k=(x>>2(n−k))%4 for k=1, . . . ,n (3)

In the present invention, the number n of bases of the base block is called a coding order. Bases in the embedding region D_iare coded to a code value X_ibased on the coding order n; X_i={x_k|k∈[1,N_i]}, N_i=└|D_i|n┘. Here, the number N_iof code values is determined by the coding order n.

1-2. False Start Codon Prevention

The false start codon may occur in an intra code value or inter code values as follows.

1) Intra Code Value

a code value domain based on the coding order n is z∈Z=┌0,2²ⁿ−1┐. In the case of n>2, as shown in FIG. 2A, false start codons of n−2(n>2) numbers may occur in the code value domain. The number of code values containing false start codons occurring at arbitrary positions j∈[1,n−2] in the base block is 2^2(n-3)and thus the total number of code values containing false start codons occurring at n−2 positions is (n−2)×2^2(n-3). The code value containing the false start codon z′ is defined as follows.

$\begin{matrix} z^{C} = \sum_{k = 1}^{j - 1} b_{k} 2^{2 (n - k)} + 0 \times 2^{2 (n - i)} + 1 \times 2^{2 (n - j + 1)} + 3 \times 2^{2 (n - j + 2)} + \sum_{k = j + 3}^{n} b_{k} 2^{2 (n - k)} & (4) \end{matrix}$

for ∀j=[1,n−2] and ∀b_k∈[A,T,C,G], k=1, 2, . . . , j−1, j+3, . . . , n

Here, the symbols ‘A’, ‘T’, and ‘G’ correspond to 0, 1, and 3 as shown in Formula (3), and except for consecutive bases {A,T,G} on arbitrary positions, all bases at remaining positions have {A, T, C, G}. According to the present invention, in coding of the base, a code value table Z^c={z^c} including the false start codon is generated in advance, and then an embedding process is performed for a watermarked code value x′ not to be included in the Z.

2) Inter Code Values

The false start codon may occur between a base block x′_i−1of a previous watermarked code value x′_i−1and a base block x′₁of a current processed code value x′₁. As shown in FIG. 2B, in the case of (x′_i−1x′_i), when ( . . . A, TG . . . ) or ( . . . AT, G . . . ) the false start codon occurs in the middle portion thereof. Thus, two code values including the false start codon therebetween are defined as follows.

x′
_i−1(n−1,n)∥x′_i(1,2)∈Z^c (5)

if (x′_i−1%2⁴)=f(‘AT’)=1 and (x′_i>>2(n−1))%2²=f(‘G’)=3

if (x′_i−1%2²)=f(‘A’)=0 and (x′_i>>2(n−2))%2⁴=f(‘YG’)=7.

x(j,j+1) indicates the j-th and j+1-th bases of the code value x, and ∥ indicates a concatenation operator. x′_i−1(n−1,n)∥x′_i(1,2) indicates a code value where the n−1-th and n-th bases of x′_i−1are concatenated with the first and second bases of x′_i. In the present invention, when the previous watermarked code value x′_i−1is provided, the number of embedded bits for the code value x_iis controlled to prevent the current watermarked code x′_ifrom satisfying the above condition.

2. Embedding Region (Target Region) Selection

In the present invention, a watermark is embedded into a code value string generated in units of a base block. Here, a region with a short sequence length is not suitable for a watermark embedding target due to a short code value string. Thus, the embedding region is a region having a or more code values, and a set Γ(n) of embedding regions for the coding order n is defined as follows.

Γ(n)={D_i∥D_i|>αp×n},D_i={b_ii|j∈[1,|D_i|]} (6)

Here, D_iindicates the i-th embedding region, b_iiindicates the j-th four-letter base in the D_iregion, and |D_i| indicates the number of bases in D_i. α indicates the minimum number of code values in the embedding region, and x indicates a prediction order, which will be described in section 3. According to an embodiment of the present invention, the minimum value of code values is set to 10 or more, and the embedding region is selected based on the prediction order x.

A ratio of the number of embedding regions to the total number of non-coding regions on the given DNA sequence is designated by R_region(n), and a ratio of the number of bases in embedding regions to the number of bases in total non-coding regions is designated by R_base(n). FIG. 3A shows the ratio R_region(n) of the number of embedding regions and the ratio R_base(n) of the number of bases when the coding order n ranges 2 to 10 on the DNA sequence. FIG. 3B shows the code value level with respect to the coding order n and the number of code values, when the number of bases is 100. Referring to these figures, R_region(n) decreases in proportion to increase of n, but R_base(n) is maintained at 92% or more. In the case where the number of bases is given, when n increases, the number of code values geometrically decreases, but the code value level increases. That is, when the code value level is high, the range of watermarking signal processing is wide and the number of bases is maintained, but the number of target code values is small, and thus watermark capacity is limited. In the present invention, since multiple bits per code value are embedded, when the code value level increases, the number of embedded bits per code value increases, but the number of code values decreases. Thus, on the given non-coding region, the optimum coding order n for the watermark capacity is required.

3. Code Value Prediction-Error Expansion (PE)-Based Reversible Watermarking

When a code value of the non-coding region is given, a prediction-error expansion method used in a conventional image data may be used to embed a bit in a pair of code values. For example, when a prediction {circumflex over (x)} value a with respect to an arbitrary code value x and a watermark bit w are given, the embedded code value x′ is as follows.

x′={umlaut over (x)}+2(x−{umlaut over (x)})+w=2x−{umlaut over (x)}+w (7)

Watermark extraction and code value restoration are easily obtained from {umlaut over (x)} and x′ as

$w = x^{'} - \hat{x} - 2 ⌊ \frac{x^{'} - \hat{x}}{2} ⌋, x = \frac{1}{2} (x^{'} + \hat{x} - w) .$

This method is suitable for image data with high correlation between adjacent pixels. By a prediction error modeled as Laplacian distribution, one bit can be embedded into each of pixel pairs.

However, code values of the DNA sequence have a low correlation between successive predictors, and thus an adaptive prediction is required. Also, code values can be moved without limitation under false start codon limitation conditions, and thus multiple bits can be embedded in a pair of code values. Thus, in this section, a code value prediction-error expansion-based multi-bit embedding method will be described.

3-1. Code Value Error Expansion Condition for Multi-Bit Embedding

Except for false start codon values, DNA code values having no condition for definition move without limitation within a valid range. Thus, the prediction error d for a pair of code values can be expanded 2^ktimes according to an expansion condition to embed k bits, and at most 2n−1 bits can be embedded; k_max=2n−1.

When k bits of watermark {w_j}₁^kand a prediction value {circumflex over (x)} are given, a k-bit embedded code value x′ is obtained by the 2^ktimes expanded prediction error d as follows.

$\begin{matrix} x^{'} = \hat{x} + 2^{k} d + sgn (d) \sum_{i = 1}^{k} 2^{j - 1} w_{1} where d = x - \hat{x} & (8) \end{matrix}$

When the embedded code value x′ and the number k of bits are given, watermark extraction and restoration are easily performed as follows.

w
_i=((x′−{circumflex over (x)})>>(j−1))%2 for j=1, . . . ,k (9)

x={circumflex over (x)}+d={hacek over (x)}+(x′−ĉ)>>k (10)

Since the embedded code value x′ is desired to be 0≤x′≤2²ⁿ−1, expansion condition of the prediction error d for 2^ktimes expansion is as follows.

$\begin{matrix} 2^{- k} (- \hat{x} - sgn (d) \sum_{i = 1}^{k} 2^{j - 1} w_{j}) ≦ d ≦ 2^{- k} (2^{2 n} - 1 - \hat{x} - sgn (d) \sum_{i = 1}^{k} 2^{j - 1} w_{j}) & (11) \end{matrix}$

The code value x is desired to satisfy the condition as follows.

x∈[max(0,┌ĉ+2^−k(−{circumflex over (x)}−α(k))┐),min(2²ⁿ−1, └{circumflex over (x)}+2^−k(2²ⁿ−1−{circumflex over (x)}−α(k)┘)], (12)

where

$α (k) = sgn (d) \sum_{i = 1}^{k} 2^{j - 1} w_{j} .$

Such the expansion condition is determined depending on watermark k bits and {w_j}₁^kthe prediction value {circumflex over (x)}, and the number of bits to be embedded in the code value x is determined depending on the expansion condition.

FIG. 5A shows the number of bits to be embedded in the code value x for each prediction value {circumflex over (x)} when the coding order is n=4 (x,{circumflex over (x)}∈┌1,2^s−1┐) and all watermark bits are 1 w={1}. The maximum number k_maxof embedded bits is 2n−1=7. FIG. 5B shows a range of code values x depending on the number of embedded bits when the prediction value {circumflex over (x)} is 0, 128, and 255. When the number of embedded bits is large, an expandable region is geometrically narrow, and when {circumflex over (x)} is close to 0 or 255, the number of embedded bits is small.

3.2 Code Value Prediction

FIGS. 5A and 5B show code values and code value histograms of ‘AE017199’ and ‘CP000473.1’ sequences, when the coding orders n are 3 and 4. The code value histogram is expanded or reduced depending on the coding order, but distribution is not standardized depending on the sequence. That is, code values of the ‘AE017199’ sequence are evenly distributed in, except for four regions, the remaining regions, and code values of the ‘CP000473.1’ sequence are evenly distributed with white noise in the whole regions. Also, the code value sequence appears in random form, and correlation between successive predictors is extremely low. Thus, in the present invention, in order to reduce the prediction error for the code value, the code value is predicted based on a local LS predictor, such as Dragoi, etc.

A row vector of x code values for predicting the current code value x_iis x_i=(x_i−1, . . . , x_i−v) and a row vector of x parameter is b=(β₁, . . . , β_v). Here, x indicates a prediction order. When x_iis observed, the prediction value {circumflex over (x)}₁of x₁is defined by a linear regression function ƒ_β(x) as follows.

$\begin{matrix} {\hat{x}}_{i} = f_{β} (x_{i}) = \sum_{i = 1}^{p} β_{j} x_{i - j} = x_{i} b^{'} & (13) \end{matrix}$

When a row vector of all code values in an arbitrary embedding region is y=(x₁, . . . , x_N) and N×p matrix of N observed previous code values is X=(x′₁, . . . , x′_N), LS predictor computes parameter t that minimizes the square distance) ∥y′−Xb′∥²=(u′−Xb′)′(u′−Xb′) between u′ and Xb′ as follows.

b=(X′X)⁻¹X′y′ (14)

In the present invention, rather than whole prediction on whole embedding regions, local prediction for each embedding region is performed to predict the code value. Thus, in decoding process, additional information of |Γ(n)|×t which is parameter t by the number |Γ(n)| of embedding regions of the DNA sequence is required.

The code value may be predicted using a successive predictor {circumflex over (x)}_i=x_i−1or a mean predictor

${\hat{x}}_{i} = \sum_{i = 1}^{p} x_{i - j} / p .$

FIGS. 6A and 6B show prediction error histograms for successive predictors, mean predictors, and LS predictors when the coding orders are n=3 and n=4 for ‘AE017199’ and ‘CP000473.1’ sequences (p is a prediction order (the number of successive predictors used in prediction), and ER (expandable region) is expansion region occurrence probability).

In FIG. 8, ER indicates expansion region occurrence probability. A successive predictor error has an ER of about 74.8% regardless of the coding order. The mean predictor and the LS predictor have relatively high ER in the case of the coding order n=3, and when the prediction order x is high, ER is high. Particularly, in the case of n=3 and x=20, the LS predictor has the highest ER of 91.6%. That is, in the case of n=3, when the prediction order x of LS is high, insertion capacity is large.

The prediction error histogram of an image is modeled as Laplacian distribution, but the LS prediction error histogram of the code value is modeled as normal distribution that (μ,σ)=(0,20) with n=3 and x=10, (μ,σ)=(0,19) with n=3 and x=20, (μ,σ)=(0,80) with n=4 and x=10, and (μ,σ)=(0,76) with n=4 and x=20.

3.3 Coding Process

In the coding process of the present invention, when the coding order n and the prediction order are given, an LS prediction parameter t is obtained for each embedding region. The LS predictor by t is used for the code value x_iwith i>p, and the mean predictor is used for the code value with i≤x, thereby obtaining {circumflex over (x)}₁.

$\begin{matrix} {\hat{x}}_{i} = {\begin{matrix} \sum_{j = 1}^{p} β_{j} x_{i - j}, & if i > p \\ \sum_{j = 1}^{i - 1} \frac{x_{i - j}}{i - 1}, & if 1 < i ≦ p \\ 0, & if i = 1 \end{matrix} & (15) \end{matrix}$

After determining the number k_i(0≤k_i≤2n−1) of embedded bits based on expansion condition of the prediction error d_i=x_i−{circumflex over (x)}₁, k₁bits {w_I}_I=1^k¹are embedded in the code value x₁as follows.

$\begin{matrix} x_{i}^{'} = {\hat{x}}_{i} + 2_{i}^{k} d_{i} + α (k_{i}) where α (k_{i}) = sgn (d_{i}) \sum_{I = 1}^{k_{i}} 2^{I - 1} w_{I} & (16) \end{matrix}$

x′_i∉Z^tand x′_i−1(n−1,n)∥x′_i(1,2)∉Z^t

When the embedded code value x′₁is included in a false start codon tale Z^tor the previous code value x′_i−1includes the false start codon, the number k_iof embedded bits is reduced by one, and then the above-described process is repeated until k_iis zero. In this way, multiple bits are embedded in code values of all embedding regions, and then a watermarked region Γ′(n) is obtained. When k_iis 0, it indicates a non-embedding region of the prediction error or a case where the false start codon occurs.

The number K={k_i} of embedded bits for each code value and the prediction parameter t for each embedding region are additional information required in watermark extraction and original sequence restoration. It is required that the additional information is included in the watermarked region Γ′(n) and is transmitted without occurrence of the false start codon and generation of another additional information. In the present invention, by arithmetic coding, lossless compression is performed on the number K of embedded bits, the prediction parameter t, and an LSB bit E of a 2-bit base binary number in Γ′(n), thereby generating a compression bit string C={c_i}. The compression bit c_iis substituted to the LSB of the binary number b′_iof the four-letter base as follows.

b′
_i=(b′_i>>1)<<1+c₁, if b′_{i−2≠‘A’}and b′_{i−1≠‘T’} (17)

Here, in a case where two previous embedded bases (b′₁₋₂,b′₁₋₁) are “AT”, when the current base is b′₁=‘G’, b′₁is substituted by one of ‘A’, ‘T’, and ‘C’. When b′₁≠‘G’, embedding is omitted. Finally, a base string “AT” in the embedding region Γ″(n) including a compression string C performs as a marker directly indicating that a subsequent base does not include a compression bit. The length of the compression string C is determined by a compression algorithm, but in the present invention, arithmetic coding which is a general lossless compression algorithm is used. Consequently, the DNA sequence D′=D^nc+D^c, D^nc=Γ″(n)+Γ^c(n) containing the additional information and the non-coding region Γ″(n) where the watermark is embedded is transmitted.

3.4 Decoding and Restoration Processes

In decoding process, in the non-coding region Γ″(n) of the DNA sequence D′ transmitted first, from the LSB of all bases except for the base following “AT”, the number K of embedded bits of the additional information compression string C, the prediction parameter t, and the base LSB bit E are obtained. The code sequence X′ of Γ′(n) where the base LSB bit E of Γ″(n) is substituted is obtained by the coding order n. From all code values in X′, the watermark is extracted by the number K of embedded bits and the prediction parameter t, and the original code value is restored.

For example, when the number of embedded bits k_i>0 and arbitrary code value x′_iare given, the prediction value {circumflex over (x)}₁is obtained from the previous restored code value (x_i−1, . . . , x_i−v), and then the watermark k₁bit is extracted from the prediction error d_i=x′_i−{circumflex over (x)}₁, w₁=((x′_i−{circumflex over (x)}_i)>>(l−1))%2 for l=1, . . . , k_i. The original code value x_iis restored by k_ibit shifting of the prediction error d_ias x_i={circumflex over (x)}_i+((x′_i−{circumflex over (x)}_i)>>k_i).

3.5 Watermark Capacity and Additional Information Amount

Watermark capacity is affected by the coding order n and the prediction order x. When n and x are given, the number of watermark bits embedded in the embedding region Γ(n)={D_i}_i=1^|Γ(n)| is the sum of the number K of embedded bits for each code value in the region. Thus, the number of bits per base (bpn) bpn_FE(n,p) is as follows.

$\begin{matrix} {bpn}_{PE} (n, p) = \frac{1}{\langle Γ (n) \rangle} \sum_{i = 1}^{\langle Γ (n) \rangle} (\frac{1}{N_{i}} \sum_{i = 1}^{N_{i}} k_{j}) [bit / base] & (18) \end{matrix}$

where N_i=└|D_i|/n┘ and 0≤k_i≤2n−1

|Γ(n)| indicates the number of embedding regions, and N_iindicates the number of code values in the region D_i.

When custom-character is LSB substitutable bit amount to embed the additional information compression string C, is determined by the number of bases omitted by the false start codon in substituting process. The maximum is equal to the total number

$\sum_{i = 1}^{\langle Γ (n) \rangle} \langle D_{i} \rangle$

of bases in Γ′(n). It is required that the length of the additional information compression string C is less than the substitutable bit amount custom-character , the amount of the additional information that is the number K of embedded bits, the prediction parameter t, and the LSB E of 2-bit base is small, or an algorithm with high compression efficiency is required. When an arbitrary watermarked region D′₁(∈Γ′(n)) is given, E consists of |D_i| bits, and the number K of embedded bits is expressed by N_i┌log₂2n┐ bits, and the prediction parameter t for each embedding region is expressed by x floating points of 32 bits. Thus, additional information Extra_PB(n,p) for Γ′(n) is as follows.

$\begin{matrix} {Extra}_{PE} (n, p) = \sum_{i = 1}^{\langle Γ (n) \rangle} (N_{i} ⌈ \log_{2} 2 n ⌉ + \langle D_{i} \rangle + 32 p) [bit] & (19) \end{matrix}$

When the additional information compression string C is ρ×Extra_PB(n,p), compression is performed to be

$ρ \times {Extra}_{PE} (n, p) < Φ ≦ \sum_{i = 1}^{\langle Γ (n) \rangle} \langle D_{i} \rangle .$

4. Code Value Histogram Shifting-Based Method

Code values in a non-coding region may be shifted to, except for a code value table having the false start codon, a remaining region. In this section, non-circular and circular code value histogram shifting-based methods for increasing data capacity will be described.

4.1 Non-Circular Histogram Shifting (HS)

(1) Coding Process

In the present invention, an n order code value histogram domain Z=┌0,2²ⁿ−1┐ is divided into M sections {P_i}_i=1^M. Here, each section is provided in bilateral symmetry with respect to a center value R_i, and R_iis used as a reference value of shifting. Thus, the length of the section has a value of an odd number, and is determined by the number of embedded bits.

When the maximum number of shifting bits in the section is k_maxand the center value is R_i=z, P_iconsists of 2×2_max^k−1 values as follows.

P
_i
={z−2^k^max+1, . . . ,z−1·z,z+1, . . . ,z+2^k^max−1},for j∈[1,M] (20)

R
_i
=z (21)

The number M of sections is as follows.

$\begin{matrix} M = ⌊ \frac{2^{2 n}}{2 \times 2_{\max}^{k} - 1} ⌋ where 1 ≦ k_{\max} ≦ 2 n - 1 & (22) \end{matrix}$

Here, a residual section of 2²ⁿ−(2×2_max^k−1)M values is Z^c=Z⁻␣_i=1^MP_i, and is not selected for watermark embedding.

When an arbitrary code value x₁belongs to the section P_i, a difference from the center value R₁of the section is d_i=x_i−R₁, x_i∈P₁. Here, based on the range of |d_i|, the number k₁of bits to be embedded in x₁is determined as follows.

$\begin{matrix} \sum_{I = 0}^{k_{i} - 1} 2^{n} < \langle d_{i} \rangle ≦ \sum_{I = 0}^{k_{f}} 2^{n}, k_{i} ≧ 1, if x_{i} \neq R_{1} & (23) \end{matrix}$

k_i=0, if x_i=R₁

Next, k₁bits {w_I}_I=1^k^fare embedded in x₁as follows.

$\begin{matrix} x_{i}^{'} = R_{i} + 2_{i}^{k} d_{i} + α (k_{i}) where α (k_{i}) = sgn (d_{i}) \sum_{I = 1}^{k_{f}} 2^{t - 1} w_{1}, & (24) \end{matrix}$

x′_i∉Z^tand x′_i−1(n−1,n)∥x′_i(1,2)∉Z^t

The value x_i=R_iwhich is the center value R_iof the section is the number of embedded bits k_i=0, and is excluded from bit embedding. Here, when a shifted code value x′_iis in the false start codon table Z^tor when the false start codon occurs between the x′₁and the previous shifted code value x′₁, the number k₁of embedded bits is reduced by one until reaching zero. This process is repeated. Thus, the false start codon is prevented in the same manner as a successive code value pair DE method. In this way, for all code values in the embedding target region, multiple bits are embedded depending on the number of embedded bits for each code value, and then the watermarked non-coding region Γ′(n) is obtained.

As additional information for watermark extraction and original sequence restoration, the number K={k_i} of embedded bits for each code value, a marker T={τ} of a section shifted based on a section reference value and the LSB bit E of the 2-bit base binary number in the watermarked non-coding region Γ′(n) are required. Like the successive code value pair DE method, a bit string C of the additional information (K,T,E) is generated with lossless compression, and then the bit string is substituted by the LSB bit of the base binary number in Γ′(n). The DNA sequence D′=D^nc+D^c, D^nc=Γ″(n)+Γ^c(n) containing the final additional information and the non-coding region Γ″(n) where the watermark is embedded is transmitted.

FIG. 7 shows code value shifting based on the difference |d| from the center value R₁and a watermark bit when the maximum number of shifting bits on P_iis k_max=3. An arbitrary section P_iof a histogram domain is divided into a left subsection P_i⁻ and a right subsection P_i⁺ based on the center value R_i. In the case of |d|=1, 3-bit (k=3) embedding is possible. In the case of |d|∈{2,3}, 2-bit (k=2) embedding is possible, and in the case of |d|∈{4,5,6,7},1-bit (k=1) embedding is possible. In the case of |d|=0 and x=R_i, a bit is not embedded (k=0).

The code value x corresponding to the right subsection P_i⁺ (d>0) of the section P_iis shifted by the watermark bit to the left subsection P_i+1⁻(d≤0) of the right section P_i+1. In contrast, x corresponding to the left subsection P_i⁻(d<0) of the section P_iis shifted by the watermark bit to the right subsection P_i−1⁺(d>=) of the left section P_i−1. In other words, as shown in FIG. 8A, the code value of the right subsection of the section P_iand the code value of the left subsection of the right adjacent P_i+1are shifted to each other. In contrast, the code value of the left subsection of the section P_iand the code value of the right subsection of the left adjacent P_i−1are shifted to each other.

Among the watermarked code values, the code value which is the center value x′_i=R_iis generated in three cases. First, when the previous code value is the center value x_i=R_i(k_i=0), it is excluded in shifting. Thus, the original code value x_i=R_iis not shifted. Also, as shown in FIG. 8A, the case is that values in the right subsection P_i−1⁺ of the left section and in the left subsection P_i+1⁻ of the right section are shifted. The case where shifting is performed and the case where shifting is not performed can be distinguished by the number of embedded bits for each code value. Thus, for extraction and restoration, the shifted previous section information T={τ} is required as follows.

$\begin{matrix} τ = {\begin{matrix} 0, if x^{'} = R_{i} and x \in P_{i - 1}^{+} \\ 1, if x^{'} = R_{i} and x \in P_{i + 1}^{-} \end{matrix} & (25) \end{matrix}$

As shown in FIG. 8B, among M sections, code values from the right subsection P₁⁺ of P₁to the left subsection P_M⁺ of P_Mare shifted. Code values corresponding to the remaining boundary sections P₁⁻ and P_M⁺ are assigned with the number of shifting bits k=0.

(2) Decoding and Restoration Processes

In decoding process of the present invention, from the non-coding region Γ″(n) of the DNA sequence D′ previously transmitted, the additional information (K,T,E) of the compressed bit string is obtained, and then the watermarked non-coding region Γ′(n) by base binary number substitution of E is obtained. From the code sequence X′ of Γ′(n) watermarking and original value restoration are performed by the number K of shifting bits for each code value and the marker of T={τ} a shifted section.

When the code value x′₁of the code sequence X⁺ is given, the center value R of the original section of x′₁is required to be obtained first. That is, when the shifted section P₁of x′₁is not the boundary section (x′_i∈P₁) and the number k₁of shifting bits is k_i>0, the center value R for the previous section of x′_iis obtained as follows.

$\begin{matrix} R = {\begin{matrix} R_{j - 1}, if x_{i}^{'} \in P_{i}^{-} or (x_{i}^{'} = R_{j} and τ_{i} = 0) \\ R_{j + 1}, if x_{i}^{'} \in P_{i}^{+} or (x_{i}^{'} = R_{i} and τ_{i} = 1), if x_{i}^{'} \in P_{i} and k_{i} > 0 \end{matrix} & (26) \end{matrix}$

Here, based on the shifted section P_iof x′_i, the center value R of the section before embedding is easily obtained. However, when x′_iis the center value R_iof the shifted region P_i(x′_i=R_i), ℏ is obtained by the marker τ_iof the previous section. The watermark k_ibits {w_I}_I=1^k^ton x′₁and the original code value x₁are obtained using the center value R of the previous section as follows.

w
_I=((x′_i−R)>>(l−1))%2 for l=1, . . . ,k_i (27)

x
_i
=R+((x′_i−R)>>k_i) (28)

(3) Watermark Capacity and Additional Information

When the coding order n and the maximum number k_maxof section shifting bits are given, the number of watermark bits embedded in the embedding region

$Γ (n) = {D_{i}}_{i = 1}^{\langle Γ (n) \rangle}$

is determined based on the number of bits defined by the difference range from the center value in the histogram domain section P_iand the frequency at which the code value belongs to each section.

The frequency with z value on the code value histogram is designated by p(z). Here, the number of shifting bits on an arbitrary section P_iis calculated by the sum of the number C(P_i⁻) of shifting bits in the left subsection P_i⁻ and the number C(P_i⁺) of shifting bits in the right subsection P_i⁺.

$\begin{matrix} C (P_{j}^{+}) = \sum_{i = 0}^{k_{\max} - 1} (\sum_{t = 0}^{2^{i} - 1} p (R_{j} + 2^{i} + t) (k_{\max} - i)), for d > 0 & (29) \\ C (P_{j}^{-}) = \sum_{i = 0}^{k_{\max} - 1} (\sum_{t = 0}^{2^{i} - 1} p (R_{j} - 2^{i} - t) (k_{\max} - i)), for d < 0 & (30) \end{matrix}$

The total number of watermark bits embedded in Γ(n)={D_i}_i=1^|Γ′(n)|is the sum of the number of shifting bits on the remaining sections, except for the boundary sections P₁⁻ and P_M⁺ among total M sections, and the number of bits per base bpn bpn_HS(n,k_max) is defined as follows.

$\begin{matrix} {bpn}_{HS} (n, k_{\max}) = \frac{1}{\sum_{i = 1}^{\langle Γ (n) \rangle} N_{i}} (C (P_{1}^{+}) + \sum_{j = 2}^{M - 1} (C (P_{j}^{+}) + C (P_{j}^{-})) + C (P_{M}^{-})) [bit / base] & (31) \end{matrix}$

|Γ(n)| is the number of embedding regions, N is the number of code values in the region D_i, and

$\sum_{i = 1}^{\langle Γ (n) \rangle} N_{1}$

is the total number of bases in the embedding target region.

The additional information Extra_HS(n,k_max) for watermark extraction and restoration is the number R of shifting bits for each code value, the marker T of the section shifted based on the section reference value, and the LSB bit E of the 2-bit base binary number of the watermarked non-coding region Γ′(n). When the maximum number of shifting bits in the histogram domain section is k_max, the number of embedded bits is expressed by ┌log₂k_ma┐ bit. Thus, the number K of shifting bits for whole code values is expressed by total

$⌈ \log_{2} k_{\max} ⌉ \sum_{i = 1}^{\langle Γ (n) \rangle} N_{1}$

bits. The marker T of the shifted section is binary information determining whether the code value x′=R_ishifted based on the center value of the adjacent section is shifted from the left section or the right section, and is expressed by

$T = \sum_{i = 1}^{\langle Γ (n) \rangle} N_{i} \times \sum_{i = 1}^{M} p (x^{'} = R_{i})$

bits. E is

$\sum_{i = 1}^{\langle Γ (n) \rangle} \langle D_{i} \rangle$

bits that is the same as the number of bases of all regions in Γ′(n). Thus, additional information Extra_HS(n,k_max) is as follows.

$\begin{matrix} \begin{matrix} {Extra}_{HS} (n, k_{\max}) = K + T + B \\ = ⌈ \log_{2} k_{\max} ⌉ \sum_{i = 1}^{\langle Γ (n) \rangle} N_{i} + \sum_{i = 1}^{\langle Γ (n) \rangle} N_{i} \times \\ \sum_{i = 1}^{M} p (x^{'} = R_{j}) + \sum_{i = 1}^{\langle Γ (n) \rangle} \langle D_{i} \rangle \\ = \sum_{i = 1}^{\langle Γ (n) \rangle} (N_{i} (⌈ \log_{2} k_{\max} ⌉ + \sum_{i = 1}^{M} p (x^{'} = R_{i})) + \langle D_{i} \rangle) [bit] \end{matrix} & (32) \end{matrix}$

When a compression rate is ρ, lossless compression is performed such that additional information Extra_HS(n,k_max)

$ρ \times {Extra}_{HS} (n, k_{\max}) < Φ ≦ \sum_{i = 1}^{\langle Γ (n) \rangle} \langle D_{i} \rangle .$

When the watermark bit is not embedded k=0, it corresponds to the boundary section of the histogram domain section, the residual section that do not belong to the section, and the code value that is the center value of the section. That is, k=0 probability P(k=0|x) is as follows.

$P (k = 0 | x) = \sum_{t = 0}^{R_{1} - 1} p (x = t) + \sum_{t = R_{N} + 1}^{R_{N} + 2^{k} \max - 1} p (x = t) + \sum_{t = R_{N} + 2^{k} \max}^{□} p (x = t) + \sum_{j = 1}^{M} p (x = R_{j}) \sum_{t = 0}^{R_{1} - 1} p (t)$

is the probability of the code value in P₁⁻ section,

$\sum_{t = R_{N} + 1}^{R_{N} + 2^{k_{\max}} - 1} p (t)$

is the probability of the code value in P_M⁺ section, and

$\sum_{t = R_{N} + 2^{k_{\max}}}^{2^{zn} - 1} p (t)$

is the probability of the value in the residual section that do not belong to P. Last,

$\sum_{i = 1}^{M} p (R_{j})$

is the probability of the code values that are the center values of all sections.

$P (k - 1  x), P (k = 2  x), -- - P (k = k_{\max}  x)$

$\sum_{i = 0}^{k_{\max}} P (k = i  x) = 1$

4.2 Circular Histogram Shifting (CHS)

Unlike the pixel value of the image, code values in the non-coding region have no condition for definition, and thus shifting between the maximum value and the minimum value is possible. In the circular histogram shifting method, histogram section shifting is changed to circular histogram shifting such that embedding is possible in the left subsection P₁⁻¹(d<0) of P₁and in the right subsection P_M⁺ (d>0) of P_Mthat are the boundary sections, thereby increasing watermark capacity in the non-circular histogram shifting method.

(1) Coding Process

In the rest sections except for the boundary sections and the residual section, the watermark is embedded in the same manner as embedding process of the non-circular histogram shifting method. In circular form of the histogram domain section, as shown in FIG. 9, P₁⁻ and P_M⁺ subsections, which are two boundary sections, are not shifted by the residual section. Thus, in the present invention, P_M⁺ is shifted to the residual section such that two subsections of P_Mare separated. That is, when the number of the code values in the residual section is δ=2²ⁿ−(2×2_max^k−1)M, P_Mregion is,

P
_M
=P
_M
⁻
+P
_M
⁺ (33)

where P_M⁻={z−2^k^max+1, . . . , z−1,z}, R_M⁻=z

P_M⁺={z+δ, z+δ+1, . . . , z+δ+2^k^max−1(=2²ⁿ−1)}, R_M^+=z+δ,

divided into a subsection P_M⁻ smaller than R_M⁻=z and a subsection P_M⁺ larger than R_M⁺=z+δ. In P_Msection, two center reference values are generated.

By the center value ℏ of the section P₁to which x₁belongs on the arbitrary code value x₁

$\begin{matrix} R = {\begin{matrix} \begin{matrix} R_{j}, if x_{i} \in P_{i} for j = 1, 2, \dots, M - 1 \\ R_{M}^{-}, if x_{i} \in P_{M}^{-} for j = M \end{matrix} \\ R_{M}^{+}, if x_{i} \in P_{M}^{+} for j = M \end{matrix}, & (34) \end{matrix}$

k₁bits {w_n}_n=1^k^fare embedded as follows.

x′
_i=(R+2_i^kd_i+α(k_i))%2²ⁿ (16)

where d_i=x_i−R and

$α (k_{i}) = sgn (d_{i}) \sum_{I = 1}^{k_{i}} 2^{I - 1} w_{I}$

Here, the number of shifting bits of the residual value [R_M⁻+1,R_M⁺−1] between P_M⁻ and P_M⁺ and the code values that are the center values of respective sections is zero.

Information T on the previous section for the value x′₁shifted to the center value of the adjacent section is determined as follows.

$\begin{matrix} τ = {\begin{matrix} 0, if (x^{'} = R_{j} and x \in P_{j - 1}) or (x^{'} = R_{M}^{+} and x \in P_{1}) \\ 1, if (x^{'} = R_{i} and x \in P_{i + 1}) or (x^{'} = R_{1} and x \in P_{M}^{+}) \end{matrix} & (36) \end{matrix}$

In this way, watermarks are embedded into all code values in the code sequence X without occurrence of intra code and inter code false start codon, and the watermarked non-coding region Γ′(n) is obtained. The additional information required for watermark decoding and restoration of the original code value is the number K of shifting bits for each code value, the marker T of the shifted section, and the LSB bit E of a 2-bit base binary number, like the non-circular method. LSB substitution of the compressed additional information is applied in the same manner as the two methods, and the final watermarked DNA sequence D′ by the substituted region Γ″(n) is transmitted.

(2) Decoding and Restoration Processes

Form the substituted region Γ″(n) of the transmitted DNA sequence, the watermarked region Γ′(n) is obtained by inverse substitution, and then from the code sequence X′ in Γ′(n), the watermark is decoded by (K,T) and the original code sequence is restored.

When the code value x′₁with k_i>0 is provided in the code sequence X′, the center value R of the previous section of x′₁is obtained depending on the boundary section and the non-boundary section as follows.

$\begin{matrix} R = {\begin{matrix} R_{j - 1}, if x_{i}^{'} \in P_{j}^{-} or (x_{i}^{'} = R_{j} and τ_{i} = 0) \\ R_{j + 1}, if x_{i} \in P_{i}^{+} or x_{i}^{'} = R_{i} and τ_{i} = 1 for non - boundary region \end{matrix} & (37) \\ R = {\begin{matrix} R_{M}^{+}, if 0 ≦ x_{i}^{'} < R_{1} or x_{i}^{'} = R_{1} and τ_{i} = 0 \\ R_{1}, if R_{M}^{+} < x_{i}^{'} ≦ 2^{2_{n} - 1} or x_{i}^{'} = R_{M}^{+} and b_{i} = 1 \\ for boundary region \end{matrix} & (38) \end{matrix}$

k₁bits {w_I}_I=1^k^fand the original code value x_iare obtained by R as follows.

w
_I=(((x′_i−R)%2²ⁿ)>>(l−1))%2 for l=1, . . . ,k_i (39)

x
_i
=R+((x′_i−R)%2²ⁿ>>k_i) (40)

(3) Watermark Capacity and Additional Information

In the circular histogram shifting method, the watermark is embedded in all sections except for the residual section in the code value histogram domain range. Thus, when the coding order and the maximum number k_maxof section shifting bits are given, the number of watermark bits in the embedding region Γ(n) is the sum of the number of shifting bits on the left subsection P_i⁻ (d<0) and the right subsection P_i⁺ (d>0) of each section, and bpn bpn_CHS(n,k_max) thereof is as follows.

$\begin{matrix} {bpn}_{CHS} (n, k_{\max}) = \frac{1}{\sum_{i = 1}^{\langle Γ (n) \rangle} N_{i}} \sum_{j = 1}^{M} (C (P_{j}^{+}) + C (P_{j}^{-})) [bit] & (41) \end{matrix}$

The additional information Extra_HS(n,k_max) for watermark extraction and restoration is the same as information in the non-circular histogram shifting method, Extra_HS(n,k_max)=Extra_CHS(n,k_max). Like the above-described methods, lossless compression is performed such that the additional information Extra_CHS(n,k_max) is

$ρ \times {Extra}_{CHS} (n, k_{\max}) < Φ ≦ \sum_{i = 1}^{\langle Γ (n) \rangle} \langle D_{i} \rangle .$

The circular histogram shifting method has the same additional information but higher watermark capacity, compared to the non-circular histogram shifting method.

The previous region information of the code value shifted to the center value and information on the number of embedded bits of the code value that belong to all regions except for the residual value region are follows.

$\begin{matrix} N_{E}^{CHS} = N \times [p (x^{'} ϵR) + (1 - \sum_{t = R_{N} + 1}^{R_{N} - 1} p (t)) \times ⌈ \log_{2} k_{\max} ⌉)] [bit] & (42) \end{matrix}$

Here,

$\sum_{t = R_{1} + 1}^{R_{N} - 1} p (t)$

is probability of belonging to the residual value, and ℏ is reference value R={R₁, R₂, . . . , R_M−1, R_M1, R_M2} of the region. Thus, the bpn of additional data is bpn_E^CHS=N_E^CH/N_D[bit/base]. Capacity efficiency O^CHSthat is a ratio of additional data to the embedded data is C^CHS=N_W^CHS/N_E^CHS=bpn_W^CHS/bpn_E^CHS.

Although a preferred embodiment of the present invention has been described for illustrative purposes, those skilled in the art will appreciate that various modifications, additions and substitutions are possible, without departing from the scope and spirit of the invention as disclosed in the accompanying claims.

REVERSIBLE DNA INFORMATION HIDING METHOD BASED ON PREDICTION-ERROR EXPANSION AND HISTROGRAM SHIFTING

Information

Publication Number

Date Filed

Date Published

Inventors

Original Assignees

CPC

International Classifications

Abstract

Description

Claims

Priority Claims (1)