The invention is generally related to a method and system for embedding data covertly in a text document using space encoding.
Digital watermarking is a well researched area in the signal processing community. Many techniques been devised to hide information covertly in text and image documents. Hiding data is commonly termed “steganography” in the cryptography community. Steganography for text and image documents differs greatly since modifying pixels in an image has much less visual effect than modifying pixels in text. Therefore, existing steganography techniques for image documents are not directly applicable to text documents.
Conventional methods for data hiding in text documents include dot encoding, space modulation (line shift coding, word shift coding), luminance modulation, halftone quantization, component manipulation and syntactic methods.
Conventional methods each have their own advantages and disadvantages. For example, dot encoding has high data hiding capacity but is typically vulnerable to printing and scanning of the text document because noise is introduced and interferes with decoding the dots. On the other hand, syntactic methods are resilient to printing and scanning but have low data capacity and are not self-verifiable.
There is an increasing need to prevent unauthorized disclosure of important information in text documents, especially in this knowledge-based era. There is also a need to discourage improper information disclosure by putting a track and trace mechanism in a printed text document. In case of information leakage, the source of leakage (person who printed the document) can be identified. There is also a need for data hiding with high capacity that is resilient to printing and scanning, accommodates a wide variety of text documents with little or no restrictions, and is self-verifiable.
An aspect of the invention is a method for embedding covert data in a text document, the method comprising providing the document having first and second characters; determining a horizontal space between the characters; altering the space to produce an altered space with a predetermined horizontal distance between the characters, wherein the altered space represents the embedded covert data; and formatting the document to produce a formatted document based on the altered space.
An aspect of the invention is a system for embedding covert data in a text document, the system comprising a data encoding processing device that receives the document having first and second characters, wherein the device includes a memory and a processor; the memory stores the document and a predetermined horizontal distance; and the processor determines a horizontal space between the characters, alters the space to produce an altered space with the predetermined horizontal distance between the characters, and formats the document to produce a formatted document based on the altered space, thereby embedding the embedded covert data in the document based on the altered space.
An aspect of the invention is a computer program product comprising a computer readable medium having computer program code means which, when loaded on a computer, makes the computer perform a method for embedding covert data in a text document, the method comprising providing the document having first and second characters; determining a horizontal space between the characters; altering the space to produce an altered space with a predetermined horizontal distance between the characters, wherein the altered space represents the embedded covert data; and formatting the document to produce a formatted document based on the altered space.
An aspect of the invention is a computer readable medium having a program recorded which, when loaded on a computer, makes the computer perform a method for embedding covert data in a text document, the method comprising providing the document having first and second characters; determining a horizontal space between the characters; altering the space to produce an altered space with a predetermined horizontal distance between the characters, wherein the altered space represents the embedded covert data; and formatting the document to produce a formatted document based on the altered space.
In embodiments, the document has multiple characters that include the first and second characters, and a space between each pair of the multiple characters that are horizontally adjacent to one another is altered to represent the embedded covert data. The document may have multiple characters that include the first and second characters, and a space between selected pairs of the multiple characters that are horizontally adjacent to one another is altered to represent the embedded covert data. The document may have multiple characters that include the first and second characters that form words, and a space between the words that are horizontally adjacent to one another is altered to represent the embedded covert data. The first character may haves a left character relative to the second character, the second character is a right character relative to the first character, and the space is determined by a horizontal distance between a right-most point of the left character and a left-most point of the right character. The characters may be formed along a straight horizontal line, or along a curved horizontal line. The method may further comprise decoding the formatted document to reveal the embedded covert data based on the altered space. The embedded covert data may be a user name, a global identifier, or the like. The altered space may represent a binary sequence, and the binary sequence is two bits, or the like. The space may be an inter-character space within a word, and the space is an inter-word space between horizontally adjacent words. The space may be determined in pixels, and the altered space may be expressed in pixels. The space and the altered space may differ in horizontal distance by a single pixel. The characters in the formatted document may be visually apparent to a user and a difference between the space and the altered space is essentially visually hidden from the user. The document and the formatted document the characters may be visually apparent to a user and a difference between the document and the formatted document is essentially visually hidden to the user.
In order that embodiments of the invention may be fully and more clearly understood by way of non-limitative examples, the following description is taken in conjunction with the accompanying drawings in which like reference numerals designate similar or corresponding elements, regions and portions, and in which:
Although shown as two separate computers, it will be appreciated that the data embedding encoder and decoder modules 138 and 158 may reside on the same computer. A transmission link 146 for transmitting the original document 32 to the data encoding processing device 132, and transmission links 148 and 166 for transmitting the formatted document 36 from the data encoding processing device 132 to the data decoding processing device 152, may be public or private networks, the Internet and the like. The documents 32 and 36 may be hardcopies and/or electronic versions. If the documents 32 and 36 are in hardcopy form, the documents 32 and 36 may be converted into electronic format by scanning and the like.
In this particular context, for a formatted text document, the term “inter-word space” refers to the horizontal space between horizontally adjacent words in a text row. For example, the horizontal space between the right-most point of the left character of the left word and the left-most point of the adjacent right character of the right word. Similarly, the horizontal space between horizontally adjacent characters is the right-most point of the left character and left-most point of the horizontally adjacent right character. The term “inter-character space” of a word refers to the horizontal space between horizontally adjacent characters in that word. Lengths of inter-word and inter-character spaces may be determined and expressed in pixels.
The length L of inter-word spaces of an original text row is calculated by:
Where for a given i, si represent a particular inter-word space, i is a reference number to indicate which space is referenced, and k represents the total number of inter-word space in a text row concerned. In
In one particular embodiment, the inter-word space S=[s1, s2, s3 . . . s7, s8] is changed into S′=[s1′, s2′, s3′ . . . s7′, s8′] by modifying the inter-character space [c1, c2 . . . cn] of each word in the text row. For each word, the inter-character space, is reduced by 1 pixel if ci>2 pixels. Hence, the overall inter-word space is increased such that for each si, si′ si. By increasing the values of si′, the total length of L′ of the new inter-word space satisfies the condition: L′ L.
For convenience, the function Sign ([s1, s2 . . . sn]) is defined by:
Let smin=floor integer(average of the ε smallest value in [s1, s2 . . . sn]).
Sign([s1, s2 . . . sn])=g1|g2| . . . gn
where
The value ε is greater than or equal to the number of “−” gi selected.
The data to be hidden is represented in binary form as a sequence of “1”s and “0”s.
In one particular embodiment, the inter-word space S″=[s1″, s2″, s3″ . . . s7″, s8″] such that:
L″=s
1
″+s
2
″+s
3
″ . . . +s
7
″+s
8″
L′=s
1
′+s
2
′+s
3
′ . . . +s
7
′+s
8′
L′=L″
[s1″, s2″, s3″ . . . s7″, s8″] satisfies the following condition:
To embed bits ‘00’: Sign(S″)=+|−|+|−|+|−|+|−
To embed bits ‘01’: Sign(S″)=−|−|+|+|−|−|+|+
To embed bits ‘10’: Sign(S″)=+|+|−|−|−|−|+|+
To embed bits ‘11’: Sign(S″)=−|−|+|+|+|+|−|−
In order to encode in text with different fontsize and therefore different lengths of inter-word spacing, a scaling invariant method can be used. Let S=[s1, s2, s3 . . . s7, s8] denotes a particular inter-word space and F=[f1, f2, f3 . . . f7, f8] where each fi denotes the fontsize of the last character in the word before si.
First, S is normalized to form a scale invariant unit, V, by dividing each si by fi:
V=[v
1
, v
2
, v
3
. . . v
7
, v
8] where vi=si/fi
After this, the same encoding method as described in an embodiment of the invention may be used over V.
Printing, scanning and copying may introduce geometric distortions, which may make data extraction difficult. A variety of techniques to reduce these geometric distortions is well-known and continue to be developed. The invention is not limited to any of these techniques.
The system 10 decodes the embedded covert data in the formatted document 36. For example, using a horizontal profile of the text document as a reference point, the inter-word spaces are extracted. For each text row with an inter-word space, the Sign function described above computes the embedded “+” and “−”. With this and the encoding scheme, the hidden data is identified. In addition, the reference point can be determined using a vertical profile, horizontal profile and the like. Thus, it is not necessary to compare the original document 32 with the formatted document 36 having the embedded covert data in order to extract the embedded covert data from the formatted document 36. Other ways of determining profile or reference point is possible, for example, another way is to use optical character recognition (OCR) to determine bounding box for words and then calculate the inter-word space to get the space profile.
In an embodiment, the process for determining profile is:
where W is the width of the image l(i, j).
where H denotes the height of the strip S(i, j).
For encoding the data, preferably there is a minimum of two words in each text row, and the data capacity is proportional to the text information in the document since the robustness depends on the length of each sentence.
The invention is applicable to various text documents such as transcripts, diplomas, certificates and the like in the academic field; shares and bonds certificates, insurance policies, statements of account, letters of credit, legal forms and the like in the financial field; immigration visas, titles, financial instruments, contracts, licenses and permits, classified documents and the like in the government field; prescriptions, control chain management, medical forms, vital records, printed patient information and the like in the health care field; schematics, cross-border trade documents, internal memos, business plans, proposals, designs and the like in the business field; tickets, postage stamps, manuals and books, coupons, gift certificates, receipts, and the like in the consumer field; and many other applications and fields.
Thus, a method and system for embedding covert data in a text document using space encoding is disclosed where the space encoding changes the inter-word spacing and/or inter-character spacing within a text row to a particular format such that the data is essentially visually hidden in the text document.
While embodiments of the invention have been described and illustrated, it will be understood by those skilled in the technology concerned that many variations or modifications in details of design or construction may be made without departing from the invention.
Number | Date | Country | Kind |
---|---|---|---|
200802187-5 | Mar 2008 | SG | national |
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/SG2009/000091 | 3/17/2009 | WO | 00 | 9/17/2010 |