This is the first application filed for the instantly disclosed technology.
The present application relates to a system and method for digital watermarking of textual data.
With the rapid growth of data, and the sharing of data between parties, interest has grown in watermarking systems that can used to provide copyright protection of data, prove ownership of the data, and trace the source of data leakage. Digital watermarking typically involves changing existing data or adding additional data to digital content in a covert manner that allows a copy of a digital content to be traced to a source and/or authenticated. However, data analysis such as traditional statistical approaches or recent Machine Learning based approaches require high integrity of the data because modifications to the data may lead to significantly different analysis results. Accordingly, any digital watermarking applied should be done in a manner that preserves the integrity of the data for analytical purposes.
Existing research conducted on digital watermarking of textual data mostly focuses on file-level watermarks. File-level watermarking, involves watermarking an entire file (e.g., a document file or a worksheet file) before transferring the file to another party. File-level watermarking is vulnerable to subset attacks in which a subset of the original data set may be copied or leaked without damaging the file-level watermark.
In the case of unstructured textual data, syntactic and semantic digital watermarking can be applied which exploits the syntactic structure of sentences. However, both syntactic and semantic watermarking techniques are more suitable for unstructured text (e.g., sentence based text) because they rely on context to analyze the syntactic or semantic structure of the text. For structured textual data, syntactic and sematic techniques are not practical because a database table may contain groups of alphanumeric values that do not have interconnections to each other.
In the case of structured textual data where data is arranged in units such as cells of tabular data, digital watermarks have been proposed. In cell-level watermarking, a watermark is embedded into the structured data cells thus making it difficult to copy and leak subsets of the data in a manner that is untraceable. Database watermarking for structural textual data have been proposed that rely on techniques such as Least Significant Bits (LSB), content analysis, and partitioning to embed and extract a watermark that is resilient to typical watermark attacks such as: modification, subset, re-typing etc. However, known solutions depend on primary key attribute of the database table in a partitioning algorithm. The primary key attribute is often used to compute a partition number that represents the partition assigned to a given cell. This reliance is vulnerable to deletion or alteration attacks because it is not difficult to identify and delete the primary key column. A deletion attack involves an attacker deleting part of the data. An alteration attack involves modifying some records of the data to destroy or remove the watermark.
Additionally, existing structured textual data watermarking solutions can substantially alter the subject overall data content.
Accordingly, it is desirable to provide an improved system and method for digital watermarking of structured textual data.
In accordance with an aspect of the present disclosure, there is provided a method for embedding a digital watermark in structured textual data arranged in a table of cells that each contain a respective set of textual data characters. The method comprises electing a first subset of the cells for watermarking. For each of the cells in the first subset of cells the method includes determining a primary cell key for the cell based on one or more of the textual data characters contained in the cell, determining a cell partition number for the cell based on the primary cell key, and embedding a portion of a first digital watermark ID code at an embedding position within the cell, the portion being determined based on the cell partition number.
In accordance with the previous aspect, the primary cell key is determined based on a combination of at least one of the textual data characters contained in the cell and the number of textual data characters contained the cell.
In accordance with any of the preceding aspects, the cell partition number for the cell is determined also based on a secret key that is common for all of the cells in the first subset of cells.
In accordance with any of the preceding aspects, determining the embedding position within the cell is based on the secret key and the length of the textual data in each cell.
In accordance with any of the preceding aspects, the cells are arranged in an array of columns and rows, wherein selecting a first subset of the cells for watermarking comprises selecting a first subset of rows of cells of the array, wherein each of the cells in the rows of the selected subset are included in the first subset of the cells.
In accordance with any of the preceding aspects, the first digital watermark ID code is comprised of a plurality of visible characters, the portion of the first digital watermark ID code comprises at least one of the plurality of visible characters, and embedding the portion of the first digital watermark ID code comprises replacing a portion of the textual data characters contained in the cell with the portion of the first digital watermark ID code.
In accordance with the previous aspect, for each of the cells in the first subset of cells, the method further comprises replacing the last character of the textual data characters in the cell with a noise key character selected based on the cell partition number from a noise key index that is common for all of the cells in the first subset of cells.
In accordance with some of the preceding aspects, the first digital watermark ID code is comprised of a plurality of invisible characters, the portion of the first digital watermark ID code comprises at least one of the plurality of invisible characters, and embedding the portion of the first digital watermark ID code comprises inserting the portion of the first digital watermark ID code into the textual data characters contained in the cell.
In accordance with any of the preceding aspects, the method further comprises selecting a second subset of the cells for watermarking. For each of the cells in the second subset of cells, the method further includes determining a primary cell key for the cell based on one or more of the textual data characters contained in the cell; determining a cell partition number for the cell based on the primary cell key; embedding a portion of a second digital watermark ID code at an embedding position within the cell, the portion being determined based on the cell partition number. The first digital watermark ID code and the second digital watermark ID code each map to a same authorized recipient identifier.
In accordance with any of the preceding aspects, the method further comprises inserting a noise column in the table, the noise column comprising a plurality of cells each containing first digital watermark ID code in encrypted form.
In accordance with another aspect of the present disclosure, there is provided a method for extracting digital watermark information from textual data that is arranged in cells that each contain a respective set of textual data characters. The method comprises fetching a cell from the cells of the textual data, determining that the cell contains a portion of a digital watermark ID code embedded therein determining a primary cell key for the cell based on one more of the textual data characters contained in the cell, determining a cell partition number for the cell based on the primary cell key, extracting a portion of a first digital watermark ID code at an embedding position within the cell, the portion being determined based on the cell partition number, and repeating these steps for other cells until the digital watermark ID code is fully extracted.
In accordance with the preceding aspect, the first digital watermark ID code is comprised of a plurality of visible characters. In this example, determining that the cell contains the portion of the digital watermark is embedded therein comprises locating a noise key character at a predetermined position, the noise key character selected, based on the cell partition number, from a noise key index that is common for all of the cells.
In accordance with some of the preceding aspects, the first digital watermark ID code is comprised of a plurality of invisible characters and the step of determining that the cell contains the portion of the digital watermark is embedded therein comprises locating the portion of the first digital watermark ID code corresponding to the cell partition number at the embedding position.
In accordance with any of the preceding aspects, the primary cell key is determined based on a combination of at least one of the textual data characters contained in the cell and the number of textual characters contained in the cell.
In accordance with any of the preceding aspects, the cell partition number is determined also based on a secret key that is common to all of the cells.
In accordance with any of the preceding aspects, determining the embedded position within the cell based on the secret key and the length of the textual data in the cell.
In accordance with any of the preceding aspects, the method for extracting digital watermark information from textual data further comprises locating a noise column in the table, the noise column comprising a plurality of cells each containing first digital watermark ID code in encrypted form, and decrypting the first digital watermark ID code to extract the first digital watermark ID code.
In another aspect of the present disclosure, there is provided a computer system comprising a processor and a non-transitory memory coupled to the processor, the memory storing instructions that, when executed by the processor, configure the computer system to perform the method of any one of the preceding aspects.
In yet another aspect of the present disclosure, there is provided a computer program product comprising a non-transitory computer medium storing instructions for configuring a computer system to perform the method of any one of the preceding aspects.
The disclosed watermarking systems and methods, in at least some applications, provide one or more of the following features: preserve the usability of the watermarked data for advanced data analytics by one or both of modifying only negligible amounts of the original data and/or embedding only noise into the original data; enable a large number of unique watermarks; remove dependence on any primary key, thereby providing resistance to a primary key deletion attack; and enable blind extraction of the digital watermarks such that the original data is not required for extraction.
Reference will now be made, by way of example, to the accompanying figures which show example embodiments of the present application, and in which:
Like reference numerals are used throughout the Figures to denote similar elements and features. Though aspects of the invention will be described in conjunction with the illustrated embodiments, it will be understood that it is not intended to limit the invention to such embodiments.
The present disclosure teaches methods, and systems for digital watermarking of structured textual data.
Example embodiments are disclosed herein that provide methods and systems for watermarking structured textual data to enable one or more of data leakage traceability, copyright protection and source authentication during the lifecycle of the data. As will be described in detail below, the disclosed watermarking systems and methods are configured to, in at least some applications, provide one or more of the following features: preserve the usability of the watermarked data for advanced data analytics by one or both of modifying only negligible amounts of the original data and/or embedding only noise into the original data; enable a large number of unique watermarks; remove dependence on any primary key, thereby providing resistance to a primary key deletion attack; and enable blind extraction of the digital watermarks such that the original data is not required for extraction.
Embedding System and Process
By way of example,
As used here, an “engine” can refer to a hardware processing circuit and machine-readable instructions (software and/or firmware) executable on the hardware processing circuit. A hardware processing circuit can include any or some combination of a microprocessor, a core of a multi-core microprocessor, a microcontroller, a programmable integrated circuit, a programmable gate array, a digital signal processor, or another hardware processing circuit. Alternatively, an “engine” can refer to a combination of a hardware processing circuit. In example embodiments, each of the engines 102, 104 and 106 may be implemented by respective machine-readable instructions executing on a common hardware processing circuit.
In at least some example embodiments, the digital watermark information 108-1 to 108-M that is embedded in each watermarked data version 2000-1 to 2000-M can be mapped to a unique recipient, for example to an intended or authorized recipient, of the watermarked data version 2000-z (where z represents a generic one of the watermarked data versions 2000-1 to 2000-M). For example, in
NC watermark ID codes 100-1 to 100-M are made up of multiple characters (e.g., “m” characters) that can be visibly rendered (e.g., non-zero width characters that take up a display space) on a display output or print output. In an example embodiment, each NC watermark ID code 100-1 to 100-M is 8-characters in length (e.g., m=8), which each character being selected from the lowercase English-language alphabet visible character set {a, . . . , z}. Accordingly, each character has 26 possible values, providing 268 possible unique NC watermark ID codes that can each be mapped to a respective authorized recipient. In various example configurations, the character length used for NC watermark ID codes may alternatively be less than or greater than 8 characters, and the set of noise characters from which the NC watermark code IDs are selected may include other visible characters instead of, or in addition to, the lowercase English-language alphabet visible character set {a, . . . , z}. In some examples, the type of characters used for NC watermark code IDs may be selected based on the type of data that is being embedded with the NC watermark code IDs. For example, numeric codes can be used in the case of numeric data to better blend in with the remaining data.
IC watermark ID codes 120-1 to 120-M are each made up of multiple hidden or invisible characters that will not be visibly rendered on a display output or print output. For example, such characters can include zero-width control characters that are typically used by word processing applications to wrap lines, break paragraphs, and space words in a specific way, but which have no meaning in the context of the database table 180. The invisible characters take up storage space within the textual data, but when the textual data is rendered the invisible characters are “zero-width” characters that are not visible. In the illustrated example, each IC watermark ID code 120-1 to 120-M includes the same number of characters as each NC watermark ID code (e.g., “m” characters), however in alternative configurations the NC watermark ID codes can include a different numbers of characters than the IC watermark ID codes. In the depicted example, each character of an IC digital watermark code 120-1 to 120-M is selected from an invisible character set comprised of a set of invisible characters. In an illustrative embodiment the set of invisible characters may for example include 5 possible characters, represented as {c1, c2, c3, . . . c4, c5}. Accordingly, each invisible character of the IC watermark ID code can take 5 possible values. Therefore, in the illustrated example where m=8, there are 58 (approximately 309,000) possible unique values for the IC watermark ID codes 120-1 to 120-M.
In various example configurations, the number of possibilities can be increased by increasing the character length of the IC watermark ID code and/or increasing the number of invisible characters in the set from which the characters are selected. Similarly, the number of possibilities can be decreased by reducing the character length of the IC watermark ID code and/or reducing the number of invisible characters in the set from which the characters are selected.
Recipients are identified in watermark database 92 by mapping respective pairs of NC and IC watermark ID codes 100-1, 120-1 to 100-M, 120-M to respective email addresses that are used as authorized recipient identifiers 98-1 to 98-M. For example, in watermark database 92 of
As shown in
In example embodiments, structured textual data 1000 can be arranged as tabular data 180 as shown in
The generation of a single watermarked data version 2000-1 that includes a database table 190 embedded with watermark information 108-1 corresponding to a single authorized recipient (for example Bob@companyA.com as indicated by authorized recipient identifier 98-1) will now be described. In this regard, the actions performed by row-noise character embedding engine 102, invisible character embedding engine 104 and column noise embedding engine 106 on structured textual data 1000 to generate watermarked digital data version 2000-1 are as follows.
Row-Noise Character Embedding
Referring to
As indicated by block 310 in
As indicated in
An illustrative example of sub-process 350 for embedding a noise character into a specific cell 186 (e.g., Cij) is shown in greater detail in
In example embodiments, in order to facilitate selection of a portion of NC watermark ID code 100-1 (e.g., noise character (e.g., character “j” in the illustrated example) for embedding in a specific cell 186, NC watermark ID code 100-1 is divided into partitions. In particular, the character locations of NC watermark ID code 100-1 are partitioned into portions or subsets that each include a defined number of character locations, with each subset being assigned a successive partition number 353. In the illustrated embodiment, the character locations of NC watermark ID code 100-1 are partitioned into subsets where the defined number of character locations per subset is one. In the illustrated example, each partition includes only a single character from the NC watermark ID code 100-1, thus each partition number 353 indexes a respective noise character for embedding into a cell 186. In other example configurations, other subset sizes could be used in embodiments where more than a single character from NC watermark ID code 100-1 is to be embedded in each cell.
Each partitioned subset of the NC watermark ID code 100-1 (e.g., each character location in the illustrated example where subset size=1) is assigned a respective partition number 353. Accordingly, in the illustrated example, the first character location (e.g., location of text character “j”) in NC watermark ID code 100-1 is assigned a partition number equal to 0 (Partion0), the second character location (e.g., location of text character “n”) is assigned a partition number equal to 1 (Partion1), and so on, with the mu′ character location (e.g., location of text character “d”) being assigned a partition number equal to m−1 (e.g., Partition7 in the illustrated case where m=8).
As indicated in block 354, the row-noise character embedding engine 102 selects content from the NC watermark ID code 100-1 to embed in the subject cell 180 (Cij). In example embodiments, this selection is done by assigning a cell partition number to the subject cell 180 (Cij), and then selecting the text character(s) from the location(s) of NC watermark ID code 100-1 that have been assigned the same partition number. In example embodiments, the cell partition number assigned to the subject cell 180 (Cij) is determined based on content of the subject cell 180 (Cij). In the illustrated example (i.e. the case where a single character from the NC watermark ID code 100-1 is embedded into the subject cell 180 (Cij)), the following equation provides one example of how a cell partition number can be assigned to the subject cell 180 (Cij):
Where: m is the number of partitions; H(x) is a hash function; Pij is a primary key for the cell Cij; ks is the secret key 160 for the structured data 1000; and m is the number of partitions that the NC watermark ID code 100-1 has been divided into (e.g. the number of characters of NC watermark ID code 100-1 in the illustrated example).
In example embodiments, the primary key Pij for cell Cij is determined based on the content of cell Cij. In the illustrated example, the cell primary key Pij is based on the first character of the data included in cell Cij and the length of cell Cij. In a particular example, cell primary key Pij can be a concatenation of the first character of the data of cell Cij and the length “L” (e.g., number of characters) of the textual data of cell Cij. For example, where the first character of data is a “T” and the length or the textual data contained in cell Cij is L=14, the primary key Pij can be the character string “T14”. In some examples, the cell primary key Pij can be based on other properties and/or character locations of the data included in the cell, so long as the cell primary key Pij can be determined at a future watermark extraction time.
A concatenation of the cell primary key Pij and secret key ks are then provided to hash function H(x). The hash function H(Pij∥ks) returns a first numerical hash value. The first numerical hash value returned by H(Pij∥ks) is concatenated with the secret key 160 ks and provided to another instance of the hash function H(x) which returns a second hash number. A modulo operation is performed to return a cell partition number for the cell Cij (denoted “Partition(Cij)”) that is the remainder value (e.g., value between 0 and m−1) of the second hash value divided by the number of partitions (m). For example, if m=8, the partition number, or Partition(Cij) is a value between 0 and 7. As discussed above, the number of partitions m is the number of characters in the digital watermark (W).
The row-noise character embedding engine 102 selects the noise character at the partition location of NC watermark ID code 100-1 that corresponds to the cell partition number (Partition(Cij)) for embedding in content of cell Cij. For example,
Referring again to
The position at which the selected noise character is embedded in a cell Cij is determined based on the following equation:
Where ks is the secret key 160 described above. The embedding position is determined by applying a modulo operation to determine the remainder of the secret key ks divided by the length (e.g., number “L” of textual characters) of the data contained in cell Cij The resulting remainder value is a character position that is between 0 and (L−1).
As noted above, the first and final characters (e.g. character locations 0 and L) of the original textual data 192 of cell Cij are reserved and not available for embedding of the noise character. Accordingly, if the equation (II) returns an embedding position of 0, the embedding position used is position 1 (e.g., the location of the second character of the data contained in cell Cij, and if the equation (II) returns an embedding position of L−1, the embedding position used is position L−2 (e.g., the penultimate character location in cell Cij).
For illustrative purposes, in the example illustrated in
At block 364, the last character of the data contained in cell Cij is replaced with a noise key character selected from the NKI 140. Similar to NC Watermark ID Code, the m character locations of NKI 140 are also divided into partitions 0 to m−1. The cell partition number determined for cell Cij in block 354 using equation (I) is used again to in block 364 to select the noise key character that is located at the partition location in NKI 140 that corresponds to the cell partition number determined for cell Cij. In the illustrated embodiment, where the cell partition number is Partition0, the first noise key character “r” is selected from NKI 140, such that, as shown in
Although the first and last character locations of cell Cij have been reserved as non-embeddable positions in the presently described example, in other embodiments different locations could be reserved instead of or in addition to such locations.
Referring again to
Invisible Character Embedding
Referring to
Similar to the embedding process 300 performed by row-noise embedding engine 102, invisible character embedding engine 104 is also configured to select a subset of rows 182 located throughout table 180 for IC watermarking (Block 410 in
As indicated in
An illustrative example of sub-process 450 for embedding an invisible character into a specific cell 186 is shown in greater detail in
In example embodiments, the character locations of IC watermark ID code 120-1 are partitioned and assigned respective partition numbers in a manner identical to that described above in respect of NC watermark ID code 100-1. In particular, each character location of IC watermark ID code 120-1 is assigned a respective partition number 453, with the first invisible character location (e.g., location of invisible text character “c2”) in IC watermark ID code 120 being assigned a partition number equal to 0 (Partion0), the second invisible character location (e.g., location of first occurrence of invisible character “c4”) is assigned a partition number equal to 1 (Partion1), and so on, with the mth invisible character location (e.g., location of final invisible character “c5”) being assigned a partition number equal to m−1 (e.g., Partition7 in the illustrated case where m=8).
As indicated in block 454, the invisible character embedding engine 104 selects content from the IC watermark ID code 120-1 to embed. In example embodiments, this selection is done in the same manner as described above in respect of row-noise embedding. Namely, a cell partition number is assigned to the subject cell 180 (Ci′j′) based on the data content of the cell using the above equation (I). The invisible character embedding engine 104 selects the invisible character at the partition location of IC watermark ID code 120-1 that corresponds to the cell partition number (Partition(Ci′j′)) for embedding in content of cell Ci′j′. For the present illustrative example, let the partition number assigned to cell Ci′j′ be “partition3”. Accordingly, the fourth character of IC watermark ID code 1200-1 (e.g., character “c4” at partition1) will be selected for embedding in cell Ci′j′.
As indicated in block 456, the invisible character embedding engine 104 determines an embedding position within the data included in cell Ci′j′ for the selected invisible character (e.g., invisible character “c4” in the currently described example). In example embodiments, the embedding position can be determined in the same manner as described above (block 356,
For illustrative purposes, in the example illustrated in
Referring again to
Column Noise Embedding
Referring to
Column-noise character embedding involves inserting an extra column into the tabular data 180. The extra column comprises a column of cells that each store watermarking information that can appear as noise to an observer. In an example embodiment, the “noise column” is given a header name (e.g., field label) selected from a set of predefined header names 501. In example embodiments, the set of predefined header names 501 is stores as part of watermark database 92. In another embodiment, the noise column is given a header name based or modeled on the header names of existing columns in the table, such that the noise column is difficult for an attacker to identify and delete.
At step 550, the NC watermark ID code 100-1 contained in watermark field 544 is passed to an encryption function, while the start signal 542 and the check signal 546 are left unchanged. The encryption function generates an encrypted watermark value 552 from the NC watermark ID code 100-1. At step 555, the start signal 542, the encrypted watermark value 552, and the check signal 546 are concatenated and the resulting string is encoded using a Base64 encoder. The resulting obfuscated value 558 is stored in the cell CN. Finally at step 560, in some examples some decoration characters may be added based on predetermined modification rules to the obfuscated value 558. For example, the obfuscated watermark may be split up by inserting dashes every few characters, to produce a decorated obfuscated value 562. In example embodiments, the set of decoration characters and associated modification rules are also stored in the watermark database 92.
Watermarked Data Version
The respective sub-processes described above in respect of row-noise character embedding engine 102, invisible character embedding engine 104 and column noise embedding engine 106 on structured textual data 1000 generate watermarked digital data version 2000-1 that includes embedded digital watermark information 108-1. In the described embodiment, digital watermark information 108-1 includes three types of digital watermarks, namely: row-noise watermarking applied to a first set of scattered rows 182 of tabular data 180; invisible character watermarking applied to a second set of scattered rows 182 of tabular data 180; and column-noise watermarking applied to a column of the tabular data. In example embodiments, all three of these types of digital watermarks independently embed information that maps to authorized recipient identifier 98-1. In some examples the order of applying the three different types of digital watermarks can be varied from that described above. Furthermore, one or two of the digital watermark types may be omitted in some example applications
Extraction System and Process
If a watermarked data version is illegally copied (in whole or part) or leaked, an extraction process can be carried out on the copied or leaked data to extract one or both of the NC watermark ID code and/or IC watermark ID code, which will map to a specific authorized recipient of watermarked data version 2000-z (where 1<=z<=M).
By way of example,
Column Noise Extraction
As indicated in
As indicated at Block 630 the noise column is identified. In the illustrated example, the column header name of each of the columns 184 of tabular data 180 included in the watermarked data version 2000 is compared with the column header names in the set of noise column header names set 501 to identify a match and thereby identify the noise column. Next, a sub-process 650 is executed on the cells CN of the noise column to de-obfuscate and extract the digital CN watermark ID code 100-z from the cells CN of the identified noise column. The steps of sub-process 650 are described in detail with reference to
As indicated in block 660 of
Invisible Character Extraction
Character extraction sub-process 700 (shown in
In the illustrated embodiment, in the invisible character embedding sub-process 400 the invisible characters were embedded in the table 180 in a row-wise manner. Accordingly, if a particular row 182 of the table 180 has any invisible characters embedded therein, then all cells in that row will also have invisible characters embedded therein. Therefore, in the illustrated example, sub-process 700 starts at the first row 182 and first column 184 of the table 180. If that cell has an invisible character at the expected embedding positon, then the invisible watermark character corresponding to the partition number of that cell is extracted. Furthermore, other cells in the same row 182 are also checked for an invisible watermark character at their respective embedding positions. In some examples, once enough cells are processed for an IC watermark ID code to be extracted, the sub-process ends.
At step 710 the row and column index values i and j are initialized to point to the first cell in the first row (0, 0). At step 710, the cell 186 (Cij) is fetched. At step 730, the embedding position for the fetched cell 186 is determined. The embedding position is determined based on a secret key 160 (ks) and the length of the cell Cij as per equation (II). In computing the embedding position, the length of the cell Cij is decremented by 1 to obtain the original length before the invisible watermark character was inserted. At step 740 the character at the embedding position of the fetched cell 186 (Cij) is checked against the invisible character set {c1, c2, c3, c4, c5}. If the character at the embedding position is does not match an invisible character from the invisible character set {c1, c2, c3, c4, c5}, then at 745 the row index (i) is incremented and a new cell from the next row is fetched at 720. If, at step 740, an invisible character is detected at the embedding position, then at step 750 the partition number of the cell Cij is computed. The partition number is dependent on the secret key ks, the first character of the cell Cij and the length of Cij. Again, the length needs to be decremented by 1 to obtain the original length used to compute the partition number at the invisible character embedding sub-process 400. Once the partition number is computed, then at step 760, the invisible character found in cell Cij at the embedding position represents the portion of the IC watermark ID code 120-z corresponding to the partition number. At step 770, the system checks whether enough cells have been processed to determine the entire invisible IC watermark ID code 120—with a threshold level of certainty (e.g., each partition no. has been recovered with the same character value 2 times). If so, the sub-process 700 ends at step 780. If not, then at 775, the column index is incremented and the steps 720 to 770 are repeated. In one embodiment, the sub-process 700 stops when each of the partitions of the IC watermark ID code 120-z have been extracted a number of times with a consistent value. In another embodiment, the sub-process continues until a predetermined percentage of the data has been processed. For example, if 5% of the rows in the table have been embedded with invisible characters, the sub-process 700 may continue until all 5% of the rows containing embedded cells have been processed.
Row-Noise Extraction
Referring to
At step 810, a row index (i) and a column index (j) are initialized to point to the first cell in the first row in tabular data 180 containing structured textual data. The cell Cij is fetched at step 820. At step 830, the partition number for the cell Cij is determined as explained above using equation (I) based on the secret key 160 (ks) which is provided as input to the sub-process 800, the first character of the cell Cij and the length of the cell Cij. To determine whether the cell Cij contains a portion of the digital watermark 100 (W), at 840 the last character of the cell Cij is checked against the noise key 140 (F) character corresponding to the partition number. If the last character of Cij corresponds to the corresponding noise key character in the NKI 140 (as determined by partition Number), then the cell Cij contains a portion (or character) of the digital NC watermark ID code 100-z. If not, then the entire row specified by row index (i) does not contain row-noise characters embedded therein. In this case, at step 845, the row index (i) is incremented and control returns back to step 820 to fetch a cell from the next row 182 in the table 180. At step 850, the embedding position is determined based on the length of Cij and the secret key 160 (Ks) provided as input to the sub-process 800, as per equation (II). At 860, the character at the embedding position is extracted as the digital watermark character corresponding to the partition number 190 of cell Cij. At step 870, a determination is made whether enough cells have been processed to determine the NC watermark ID code 100-z with enough certainty. For example, while it is enough to extract watermarking characters from cells with enough unique partition numbers 190 to cover all of the portions of the digital watermark 100 (W), the sub-process may continue processing more cells until each partition of NC watermark ID code 100-z has been verified a number of times to ensure that the structured textual data 2000 was not tampered with. Accordingly, at step 870 if more cells need to be processed, the column indicator is incremented to process the next cell Cij in the current row. Once a particular row is processed, the sub-process 800 increments the row index (j) to process the next row. The decision as to whether or not to process more cells may also depend on the percentage of cells processed compared to the percentage of cells expected to contain watermarking characters embedded therein.
Arbitration Engine
The recovered DWI 108-Z includes: IC watermark ID code 120-z provided by character extraction sub-process 700; a first NC watermark ID code 100-z provided by the column noise extraction sub-process 600; and a second NC watermark ID code 100-z provided by the row-noise extraction sub-process 800. Although the first and second NC watermark ID codes 100-z should be identical, and all of the recovered watermark ID codes 100-z, 120-z should all map back to the same authorized recipient identifier 98-z, it is possible that data corruption (either unintentional or by means of an attack) may have occurred that provide a mismatch. Accordingly, in example embodiments, arbitration engine 208 is configured to match each of these three recovered ID codes back to an authorized recipient identifier 98-z. If the watermarked data version 2000 is uncorrupted, all three recovered ID codes should map to the same authorized recipient identifier 98-z, and the corresponding identifier output as the source of the analyzed watermarked data version 2000-z. In example embodiments, in the event that the three recovered ID codes do not all map back to the same authorized recipient identifier 98-z, then a majority vote (e.g. 2 of 3) is used to determine the authorized recipient identifier.
The processing system 3000 may include one or more processing devices 3002, such as a processor, a microprocessor, an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), a dedicated logic circuitry, or combinations thereof. The processing system 3000 may also include one or more input/output (I/O) interfaces 3004, which may enable interfacing with one or more appropriate input devices and/or output devices (not shown).
The processing system 3000 may also include one or more storage units 3013, which may include a mass storage unit such as a solid state drive, a hard disk drive, a magnetic disk drive and/or an optical disk drive. The processing system 3000 may include one or more storage or memory units 3010, which may include a volatile memory (e.g., a random access memory (RAM)) or non-volatile memory or storage (e.g., a flash memory, read-only memory (ROM), mass storage unit such as a solid state drive, a hard disk drive, a magnetic disk drive and/or an optical disk drive). The non-transitory memory(ies) may store instructions for execution by the processing device(s) 3002, such as to carry out the methods and systems of the present disclosure. Such instructions may include instructions 101 that configure processing device 3002 and processing system 3000 to implement watermark embedding system 100, and instructions 201 that configure processing device 3002 and processing system 3000 to implement watermark extraction system 201. The non-transitory memory(ies) may store watermark database 92. The memory(ies) may include other software instructions, such as for implementing an operating system and other applications/functions. In some examples, one or more data sets and/or module(s) may be provided by an external memory (e.g., an external drive in wired or wireless communication with the processing system 3000) or may be provided by a transitory or non-transitory computer-readable medium.
There may be a bus 3014 providing communication among components of the processing system 3000, including the processing device(s) 3002, I/O interface(s) 3004, network interface(s) 3008, memory(ies) 3010. The bus 3014 may be any suitable bus architecture including, for example, a memory bus, a peripheral bus or a video bus.
The present disclosure provides certain example algorithms and calculations for implementing examples of the disclosed methods and systems. However, the present disclosure is not bound by any particular algorithm or calculation. Although the present disclosure describes methods and processes with steps in a certain order, one or more steps of the methods and processes may be omitted or altered as appropriate. One or more steps may take place in an order other than that in which they are described, as appropriate.
Through the descriptions of the preceding embodiments, the present invention may be implemented by using hardware only, or by using software and a necessary universal hardware platform, or by a combination of hardware and software. Based on such understandings, the technical solution of the present invention may be embodied in the form of a software product. The software product may be stored in a non-volatile or non-transitory storage medium, which can be a compact disk read-only memory (CD-ROM), USB flash drive, or a hard disk. The software product includes a number of instructions that enable a computer device (personal computer, server, or network device) to execute the methods provided in the embodiments of the present invention.
Although the present invention and its advantages have been described in detail, it should be understood that various changes, substitutions and alterations can be made herein without departing from the invention as defined by the appended claims.
Moreover, the scope of the present application is not intended to be limited to the particular embodiments of the process, machine, manufacture, composition of matter, means, methods and steps described in the specification. As one of ordinary skill in the art will readily appreciate from the disclosure of the present invention, processes, machines, manufacture, compositions of matter, means, methods, or steps, presently existing or later to be developed, that perform substantially the same function or achieve substantially the same result as the corresponding embodiments described herein may be utilized according to the present invention. Accordingly, the appended claims are intended to include within their scope such processes, machines, manufacture, compositions of matter, means, methods, or steps.