The field of the invention relates to a computer implemented process for embedding a digital watermark within tokenised data, and to related systems and apparatus.
A portion of the disclosure of this patent document contains material, which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure, as it appears in the Patent and Trademark Office patent file or records, but otherwise reserves all copyright rights whatsoever.
Tokenisation involves the substitution of private identifiers, such as an individual's credit card number or social security number, with a token that is generated to conform to some user-specified format and has a 1:1 relationship with the original private identifier. This same token is always used in place of the same identifier, and never used for any other identifier.
Digital watermarking relates to the process of embedding information called a digital watermark into a digital content, while preserving the functionality of the digital content.
WO2017093736A1 discloses a process of altering an original data set by combining data anonymization and digital watermarking. In particular, the anonymisation of the original data set can be achieved using a tokenisation technique, where tokenised values are generated with a regular expression. However, the regular expression must be known at the extraction time of the watermark. Further, the tokenisation technique used includes a central vault which can cause problems for customers who have high throughput needs, or a requirement to consistently tokenise values in remote locations.
There is a need for a process in which digital watermarking is applied to tokenised data with no knowledge of the regular expression that was used to tokenise the data. In addition, a system that scales to be able to individually watermark any number of data releases is needed.
Reference is made to WO2017093736A1, the contents of which are incorporated by reference.
An implementation of the invention is a computer implemented process for embedding a digital watermark within tokenised data, comprising the steps of:
The invention provides a scalable computer implemented process that is able to provide a large number of watermarked data releases of private data that has been tokenised.
The tokenisation used is a vaultless tokenisation, in which tokens are generated without requiring a token database or vault. This can benefit solutions where data releases need to be generated with high throughput, and with a requirement to consistently tokenise values. By providing a solution that uses deterministic tokenization, the process can also achieve lower latency.
By combining digital watermarking with deterministic tokenisation, watermarked tokens can also be efficiently shared around the globe without the raw data being sent out in the clear. In comparison, vault-based tokenisation distributed around the globe requires the raw data to be sent to the token vault alongside the tokens, because both have to be stored in a centralised vault. But this sending of raw identifiers from one jurisdiction to another is often contrary to legal directives or regulations.
Aspects of the invention will now be described, by way of example(s), with reference to the following Figures, which each show features of the invention:
An implementation of the invention proposes a computer implemented process of incorporating digital watermarking on top of deterministic tokenization.
We may refer to the following terms throughout the description.
Input space—a space of all possible inputs from a set of original data that might need to be tokenised. The input space may be described using a regular expression. For example, when tokenising credit card numbers, a simple input space definition might be “[0-9]{16}”—16 decimal digits (this example ignores the complication that not all prefixes are valid, and the Luhn digit check, etc).
Token space—similar to the above, this is the space of all possible tokens that can be returned.
Tokenised data—data where the input values have been replaced with tokens.
Data release—generally refers to any release of tokenised data to a particular recipient for a particular purpose. Each data release is therefore associated with its own digital watermark. The digital watermark may be a number or other ‘ID’ which is stored in a watermark registry alongside metadata. Hence by extracting the watermark from the release, any metadata associated with the data release may also be obtained. Metadata may include for example the one or more recipients allowed to receive the data release, the purpose or intended use of the data release, how long the one or more recipients are legally allowed to retain the data, with whom they are allowed to share the data.
Watermark tokens—as will be apparent in the following description, the hash-based watermarking scheme used works by not returning any tokens that hash to a value that falls within the watermark bin—these tokens are referred to as the watermark tokens. The token space therefore consists of the watermark tokens (those that hash to the watermark bin) and the non-watermark tokens (those that hash to other bins).
Watermark inputs—with deterministic tokenization, a 1:1 mapping from all inputs to all tokens has to be fixed. Some of these inputs will therefore be mapped to the watermark tokens, but we don't want to return these tokens. The crux of the scheme therefore is that we influence the mapping so that the watermark tokens are mapped to inputs that we don't think are likely to occur—and we call those inputs the watermark inputs. As an example, if we were tokenising credit card numbers using the regular expression described above, then we might choose our watermark inputs to be those numbers that start with 0000, since these are not used for real credit card numbers so we won't encounter them. As described below, the mapping of inputs to tokens is performed on the fly by the algorithm.
Previous watermarking technique, as disclosed in WO2017093736A1, allows the generation of these tokens to be controlled so that a pattern is embedded within them.
This pattern can be varied for each data release and allows a unique identifier for the release to be embedded within and across the data itself. This identifier can be used as a pointer to an arbitrary store of metadata about the data release—the intended recipient and purpose of the release, its lineage including the privacy treatments that have been applied to it, the date by which the data must be deleted, etc. This embedded pattern is probabilistic and is extractable from a sample of the generated tokens rather than being reliant on any individual tokens, so that it is still extractable from a sufficiently large subset of a data release.
A rejection sampling based algorithm works by rejecting potential tokens (and instead generating another token) according to some pattern, and then a corpus of watermarked data is scanned to reconstruct the pattern and thus learn the watermark. The pattern embedded by the algorithm is based on the hash of the tokens—the hash space is divided into bins, each of which is assigned to a data release and then when watermarking a data release we wish to reject any tokens that hash to a value that falls within the current data release's hash bin (with each data release being assigned a different slice of the hash space). If we then scan the watermarked data, hashing the tokens and building a histogram of token counts within each bin, as shown in
Vault Based Vs. Deterministic (Vaultless) Tokenisation
The process described above works with vault based tokenisation because we get to choose the token to assign to an input at the point of tokenising that input. As long as we encounter fewer inputs than there are possible tokens we don't have to return every token, and we can choose to never return those tokens that hash to the watermark bin.
Rejection sampling has therefore been achieved with a tokenisation system that generates tokens matching the required format randomly, storing the generated token in a persistent data store (the “token vault”). At the point of generating a token, if a candidate value is generated that should be rejected then another can simply be generated. However, this reliance on a central token vault can cause problems for customers who have high throughput needs, or a requirement to consistently tokenise values in remote locations.
Because of this, it may be preferred to use a tokenisation scheme that is algorithmic (e.g. based on a format preserving encryption cipher) and thus entirely deterministic. Under such a scheme, for every possible (i.e. matching the defined format) input there is a mapped token—it is therefore assumed that it is not possible to perform rejection sampling watermarking with this system, if an input that is mapped to a token that should be rejected is encountered, there is no option but to release that token. This is due to the fact it's not possible to choose another token, because that other token would be 1:1 mapped to a different input and this would ultimately break the tokenization scheme. A solution to this problem is presented in the following paragraphs.
An implementation of the invention is a method for making rejection based watermarking work on top of deterministic tokenisation. It uses the observation that the probability of encountering a particular input is often not uniform across the input space, and endeavours to assign those tokens that should be rejected to those inputs that are least likely to be encountered. This is achieved using a combination of two format preserving encryption ciphers—one that maps the least commonly encountered inputs to the ‘rejected’ tokens or watermark tokens, and the other that maps the remaining (majority) of inputs to the other tokens or non-watermark tokens.
Advantageously, the digital watermark is embedded within the set of generated tokens and not in any metadata or redundant data.
With deterministic tokenisation, as the name implies, we have a predetermined mapping of which token will be returned for each input before tokenisation starts. There is a single regular expression defined that represents both the tokens that will be returned and the inputs that will be encountered, and this mapping is defined in terms of this expression—for example, it may be that with the expression [A-Z] that input ordinal 5 (i.e E) maps to token ordinal 10 (i.e. J).
At the point of tokenising an input, we have no flexibility to choose any token other than the mapped one, regardless of whether or not it falls within the hash bin that we would like to remain empty.
To embed a perfect watermark we need to avoid ever returning a token that hashes to a value that falls within the slice of the hash space that has been nominated as the bin assigned to the data release (hereafter referred to as a watermark token). But the watermarking extraction algorithm is tolerant to the addition of some level of random noise. Such noise decreases the level of confidence reported for the watermark match, which has the effect of requiring more tokens to be scanned before reaching the required confidence, but does not prevent successful extraction (the relationship between noise and the number of tokens required to reach a confidence threshold is well understood).
Therefore, to embed a watermark we need to meet two criteria:
When using a token vault these criteria are met—we organically discover which inputs exist in the data (we are never asked to tokenise an input that never appears) and we are able to choose which token to assign to an input at the point of tokenisation. To make watermarking work with deterministic tokenisation, we need to engineer a way for them to hold there too.
Vaultless tokenization has several advantages, such as in distributed deployments where it is often not possible to call out to a centralised vault. In comparison, vault-based tokenisation distributed around the globe requires the raw data to be sent to a token vault alongside the tokens, because both have to be stored in a centralised vault. But this sending of raw identifiers from one jurisdiction to another is often contrary to legal directives or regulations.
We wish to map these inputs (which we refer to as the watermark inputs) to those tokens that hash to a value that falls within the data release watermark bin (the watermark tokens). A cryptographic hash function (based on a secret ‘watermarking key’) used within the watermarking algorithm will distribute hashes uniformly across the hash space. This implies that the hashes falling within a range of the hash space are from values drawn uniformly from across the value space—i.e. that the watermark tokens will be uniformly distributed across the token space, as shown in
However, the mapping from an input to its token is determined by the underlying format preserving encryption cipher and is a permutation that is indistinguishable from random—it is not possible to hard-wire mappings into the cipher (at least, not without devising one's own non-standard and inevitably insecure encryption cipher). To achieve a scheme where a subset of the input space maps to a subset of the token space, we have to treat the subspaces as distinct spaces with their own separate encryption cipher. To encrypt a value, we follow these steps:
To implement this solution, we need to solve two problems:
Hence each format preserving encryption cipher uses a secret key. A further secret key—the watermarking key—is used by the cryptographic hash function in order to prevent an attacker learning whether a token is a watermark token or not.
To be able to exploit a distribution in the values appearing within input data sets, we first need to know what this distribution is—armed with this information, we know which inputs appear either rarely or never, and can endeavour to assign the watermark tokens to those inputs. Examples of approaches to this are now described:
In some situations, there exists some a priori knowledge of the structure of the input data that can be used to describe regions of the input space that will never be encountered. For example, not all numbers that conform to the Social Security number structure are actual Social Security numbers because numbers with 666 or 900-999 in the first digit group are never allocated.
Making Assumptions about the Data
If the data has some external meaning—examples include names, email addresses, salaries—then it is likely that it may fit some general heuristics about data that we can use to make an educated guess about where to allocate the watermark tokens. For example, in English language text the digraph “th” is likely to be much more frequent than “qz”, and for many numeric distributions the probability density function is often low at the upper end of the range.
We can use this to allocate the watermark tokens—for example, if presented with a regular expression of [A-Z][a-z]{1,9} we could define a range such as (Qz|Jq|Jx|Qx)[a−z]{0,8} to capture those inputs that we expect to appear least frequently (or never).
Benford's law states that in many numeric data sets the leading digit is likely to be small, and the probability of a particular digit being the leading digit decreases logarithmically as the digit increases. Benford's law generally applies to datasets that have a lognormal distribution and so its direct usage is probably not general enough, but we can generalise to say that for numeric data that falls within some defined range, the probability density function is often low at the upper end of the range. This holds for the lognormal distributions spanning several orders of magnitude that satisfy Benford's law, as well as for normal distributions (e.g. height), distributions with long tails (e.g. salary), and monotonically increasing values (e.g. identifiers drawn from database sequences). Therefore, simply allocating the watermark tokens to the upper end of the numeric range can be expected to give good results for a lot of data sets.
If nothing is known of the input distribution in advance, and if the assumptions in the previous section do not hold, then it may be possible to infer the input distribution by scanning the data and observing it. Note however that this may be a weaker solution to the previous techniques because the configuration has to be finalised from a scan of the current data, but there are no guarantees that future data will have the same distribution. For example, if the data was drawn randomly from within a space with uniform probability, then we may be able to find regions within the space with no values to assign watermark tokens to, but future values may fall within these ranges (whereas Social Security numbers will never contain the special numbers, and data that follows Benford's Law will continue to do so in the future).
The watermark tokens are defined as being those tokens that hash to a value that falls within the watermark bin. Since a hash function is a one-way function, there is no way to be able to take the watermark bin hash values and find the tokens that will hash to them. The only way to find if a token's hash falls within the bin is to hash it and find out, and attempting to brute force the entire token space to find all of the watermark tokens is infeasible for all but the most modest token spaces. Instead, we use the fact that the watermark tokens will be uniformly distributed across the token space. This means that if we divide the token space into segments then on average these segments will contain equal numbers of watermark tokens—for example, if we divide the space into as many segments as there are watermark tokens then on average each segment will contain a single watermark token (some segments may not contain any, and some will contain multiple, but on average each segment will contain a single watermark token). As an example,
Since we have divided the input space into those inputs that we want to map to normal tokens and those that we want to map into watermark tokens, we can define a segment size that gives us as many segments as there are watermark inputs. We then define that each segment contains exactly one nominated token that we will assign to a watermark input. We would like to nominate a true watermark token (i.e a token that hashes to the watermark bin), but we do not know in advance which token within the segment this is (nor even whether the segment does in fact contain a watermark token). When we need to choose the nominated token, we search the segment to find the first token within it that is a true watermark token (falling back to returning the last token within the segment if none are).
When we need to choose the nominated token, we search the segment using the following process:
Note that there may be two ways in which this system may lead to imperfect watermark input→watermark token and non-watermark input→non-watermark token mappings:
As we can see, the first outcome weakens the embedded watermark, and is more likely to occur when we declare too few watermark inputs. The number of watermark inputs we declare drives the number of nominated tokens we consider, but this is independent of the number of watermark tokens that actually exist. The number of watermark tokens depends on the number of hash bins and the token space size. If the number of watermark inputs is less than the number of watermark tokens, then it is clear that this scenario will occur.
The second outcome is benign, unless its occurrence also implies the occurrence of the first outcome—for example, if the declared number of watermark inputs exactly matches the actual number of watermark tokens then it is likely that the segmentation will give imperfect results, and the fact that some segments produce outcome two means that others must produce outcome one.
The optimal strategy is achieved when the number of watermark inputs exceeds the number of watermark tokens by a comfortable margin such that the segments are small enough that the probability of a segment containing multiple watermark tokens is small. But this should not be achieved by artificially inflating the number of watermark inputs since this may lead to assigning watermark tokens to inputs that are not really encountered more rarely than other inputs (and thus introducing noise to the watermark through the frequent release of watermark tokens).
If an attacker was able to determine that a particular token is a watermark token, then she would know that the corresponding input is drawn from within the space of less frequently encountered inputs—an unacceptable leak of information. However, because the algorithm uses a secret watermarking key when embedding the watermark (using a cryptographic hash function) this kind of inference is impossible without access to this key and an attacker can learn nothing more from the tokens when compared to a vault based solution.
A worked example is now described to illustrate the complete algorithm.
This section works through the steps of the complete algorithm for two scenarios simultaneously, an input that falls in the watermark inputs region and one that does not. We use the same example scenario as discussed above.
Step 1: Determine the Index of the Value within the Input Space
For each input, we first determine which subspace it is contained within and then its index within that space.
In our example, the input 75 (input with ordinal 123) falls within the non-watermark inputs subspace where it has index 117, and the input 76 (input with ordinal 356) falls within the watermark inputs subspace where it has index 9.
We now encrypt the input within its subspace—taking the index within this space and using a format preserving encryption method to obtain another index within the same space.
In our scenario, the input 75 has index 117 within a subspace of size 347—in this example, this encrypts to index 226. The input 76 has index 9 within a subspace of size 53, which encrypts to index 23.
Step 3: Find the Token Space Segment that these Inputs Map to
We now need to find the token ordinal that these subspace indices map to. By definition, we have one nominated token per segment—since there are a total of 53 watermark inputs then this means we define 53 segments. We try to balance the size of these segments as far as possible, so we create 29 segments of size 8, and 24 of size 7.
Since there is one nominated token per segment, then the nominated token with index 23 is obviously in the 23rd segment. To find the non-watermark token with index 226 (the 226th non-watermark token in the space), we need to skip over segments until we have passed 225 other non-watermark tokens. The first 29 segments each contain 8 tokens, one of which is a nominated token, so once we have skipped all of these we have passed 203 non-watermark tokens. The remaining segments contain 7 tokens (6 of which are non-watermark tokens) and so we need to skip a further 3 of these to bring us to a total of 221 non-watermark tokens. Therefore we can say that the non-watermark token with index 226 will be the 5th non-watermark token in the 33rd segment.
Having found the correct segment for each case, we now need to find the relevant token within it.
For the input 112, we need to find the nominated token within segment 23. Our starting point in this segment is index 3, and we have to seek forward testing each token until we discover that the 7th token in the segment (the 5th token we test) is a watermark token and hence the nominated token.
All that remains now is to map the segment tokens that we found to their ordinal within the entire token space, as shown in
We now finally see that the input 111 with ordinal 123 tokenises to the token with ordinal 256, and the input 112 with ordinal 356 tokenises to the token with ordinal 184.
Although the watermarking scheme has been described when combined with vaultless tokenization, it may also be extended to be combined with a vault-based tokenization. With a vault scheme, watermarking is typically an easier problem to solve as the system can be configured such that watermarked tokens are not outputted when specific inputs are encountered. However, a vault scheme stops working once more inputs are seen than there are non-watermark tokens. This is because in that case watermark tokens would have to be returned. However, the process described also provides a solution that would avoid this problem when combining watermarking with vault-based tokenization.
Further details on the algorithm are now provided.
To enable the watermark to be extracted from just the token (with no knowledge of how it was generated), the pattern is embedded using the hash of the tokens. Note that, although this document discusses the process in terms of “tokens”—understood to be the output of a consistent tokenisation operation—the same watermarking methodology would apply to any process that produces output containing some pseudorandomness.
For example, the output of the blurring of numeric values could be hashed and subjected to the same process.
The hash space will be divided into equal width bins, and each bin will be assignable to a different data release (and therefore the number of data releases that can be watermarked is equal to the number of bins). The diagram shown in
To embed the watermark for a data release, we reject any tokens that hash to a value that falls within that data release's watermark bin. The fraction of tokens rejected in this scheme is therefore 1/N, where N is the number of data release watermark bins.
The process to extract the watermark is illustrated in
The downside of this algorithm is that as the number of distinct data releases (bins) increases, so does the number of records needed to extract the watermark. To find the watermark, we need all of the other N−1 bins to contain at least one value. This scales badly with the number of data releases.
Inspired by Bloom Filters, we can use multiple hash functions and have them work together in a Hash Array structure. To embed the watermark, we reject any token that falls within the watermark bin for any of the hash functions (see below for an alternative method of using this configuration that was tested but ultimately rejected). Then when extracting the watermark, we find the single bin index that is empty in every one of the array's hash function bins—we find the empty bins for each hash function, and take the intersection of these sets, as shown in
Note that this gives us a higher rate of token rejection (when compared with the same number of bins and a single hash function) as there are now multiple chances for a token to be rejected.
The Hash Array structure allows us to tune the number of bins and the number of hash functions to balance the number of supported data releases and the token rejection rate.
However, the configuration must be decided up-front, which forces us to decide how many data releases we wish to support in advance and creates a finite pool of watermarks. But it is possible to dynamically scale the number of watermarked data releases by creating new instances of the Hash Array structure (each with its own hash function keys) and assigning different tranches of data releases to each. We call this structure a Multi Hash Array, and it has the following properties:
When extracting the watermark, we perform a set of parallel Hash Array extractions—one per commissioned instance. As mentioned, since each instance has its own hash function keys, the watermark pattern will appear in a single instance only, with the other instances just observing a uniform distribution of hashes (which appears as random noise). This is illustrated in
The partitioning of data releases that this structure imposes may also provide additional functional benefits:
Each set of individual Hash Array keys is generated using a scheme like HKDF (a simple key derivation function KDF based on HMAC message authentication code) that allows expansion of a single master key into many different derived keys (and the ability to efficiently obtain a specific key by providing the ‘ID’ of the key in the input key material). However it would be possible to have finer grained control of keys (a different master key per tranche of Hash Arrays, or per individual Hash Array) which might also have a couple of security benefits:
The number of unique watermarks that can be embedded and the fraction of tokens rejected (the token rejection rate) depend on the configuration parameters of the algorithm, which are:
Since one bin is used for each data release, it is clear that the number of supported data releases is given by:
To embed the watermark for a data release, we reject any tokens that hash to a value that falls within that data release's watermark bin for any of the hash functions. Thus, the token rejection rate is given by:
The data set that we are attempting to extract a watermark from may not be a clean collection of tokens with no watermark tokens: it may have been doctored through the addition of new synthetic rows; it may be a combination of outputs from several data releases; or it may be that the assumptions made about the data shape when assigning the watermark inputs were not perfectly correct). Since the watermark is embedded using a secret key, it is not possible to craft noise that will be overrepresented in any particular bin without access to this key (either directly or indirectly through the watermark extraction function), which we assume is not available to anyone trying to erase a watermark.
Therefore the addition of synthetic rows will manifest as a baseline level of noise on top of the pure watermark, and the combination of multiple data release watermarks will manifest as several bins each some fraction below the baseline level.
We therefore have the following requirements for our extraction algorithm:
Any approach that attempts to find empty bins would have to loosen the definition of “empty” to handle noise and could only do this by defining some threshold fraction/count below which the bin is empty and above which it is not, and introduces an element of arbitrariness into the algorithm that is unsatisfactory.
To avoid this, the extraction method does not try to determine “which data release bins are empty?”. Instead, it reframes the question to ask, for a given data release, “are we sufficiently confident that the data contains a watermark for this data release bin?”, which it answers by calculating how likely it is that we would observe the current number of hashes in the data release bin if the watermark was not present and we were just observing noise in the bin.
By asking this question of every data release bin, we can return the data release (or data releases) that we are confident were the sources of the watermarked data (because it is sufficiently unlikely that the observed number of hashes in their bins could be down to random noise).
The watermark extraction algorithm is a simple hypothesis test at every bin, which determines whether there is sufficient evidence to reject the null hypothesis (the data does not contain a watermark for the current bin) in favour of the alternative hypothesis (the data does contain a watermark for the current bin). It does this by computing the probability of getting the observed number of hashes or lower in the current bin if the data did not contain a watermark for the bin data release. If this probability is lower than the significance level implied by the user-provided confidence level then we reject the null hypothesis that a watermark corresponding to the data release bin is not present and instead declare the presence of such a watermark.
With multiple hash functions we have multiple instances of the hash array structure. When considering a watermark bin, we sum the hashes for the bin (and the total number of hashes) across all of the hash functions.
If the null hypothesis was true, we would expect hashes to fall into a bin as often as the other bins. However, if a watermark for the bin was present, we would expect to observe a much lower fraction of hashes in that bin compared to the other bins.
The p-value is defined as the probability of obtaining results at least as extreme as the observed results when the null hypothesis is true. In our case, this is the probability of getting the observed number of hashes (or fewer) in the bin when the data doesn't contain a watermark corresponding to the bin.
To compute the p-value, we model the hashing of tokens into different bins as a binomial distribution, where the probability of a token hashing into a given bin is 1/b when there is no watermark.
Note that when the data has a watermark that corresponds to a different bin, then this probability will be greater than 1/b. However, in those cases the actual p-value will always be less than the p-value computed with 1/b, and so we can always safely reject the null hypothesis if the computed p-value is less than alpha.
Therefore the probability of seeing k hashes in a bin after observing n hashes overall is given by:
Hence, the p-value for a bin containing k hashes after observing n hashes overall is given by:
The smaller this p-value is, the more evidence we have for the presence of the watermark.
The extraction algorithm takes a confidence level as user input. This is interpreted as 1-α, where a is the statistical significance of the test—that is, a bound on the probability that we will wrongly declare the presence of a watermark in data where no such watermark exists (and so the significance level provides a bound on the false discovery rate).
If the computed p-value for a bin is lower than the significance level a, then we reject the null hypothesis and declare the presence of a watermark for the bin. If it is not, we fail to reject the null hypothesis and declare that no watermark for the bin is found in the dataset.
Caution: This confidence level shouldn't be interpreted as the probability of a watermark actually being present when we reject a null hypothesis.
When extracting an unknown watermark from data we need to test every data release bin, and for the process to be successful the decisions from all of these tests must be correct. Testing multiple bins at the same time for the presence of a watermark increases the false positive rate of the overall test beyond the false positive rate of a single test for one bin. To address this issue, we use the Holm-Bonferroni Method, which ensures that the overall error of a family of tests stays below the required error limit (whilst ensuring a higher statistical power than the standard Bonferroni correction which, for our context, means an increased ability to detect the presence of multiple watermarks simultaneously mixed into a dataset). It does this by reducing the value of α used for each test by a factor of the number of tests to be performed, which is the number of data release bins to test.
The process is to first sort the bins by p-value (lowest first) and then to use a different significance level (α) for each bin. The bin with the lowest p-value is tested first at the significance level of a/b. If the p-value for this data release is less than the required significance level, then the bin with the next lowest p-value is tested, this time using a significance level of α/(b−1). This process continues until we encounter a bin whose p-value is greater than its corresponding significance level. Hence, the values of a for extracting a combination of data release watermarks will be:
When there is a watermark with no noise added, we have an empty bin. Hence, the p-value is just the probability of all n tokens hashing to a bin other than the current bin:
To detect the watermark, the computed p-value must be less than or equal to the (Holm-Bonferroni corrected) significance level, thus the minimum number of tokens is the point where:
Rearranging gives the equation for computing the number of tokens required for extracting a watermark with no noise:
The only difference to the above derivations when there are multiple Hash Array instances is that there are now m×b possible watermarks, rather than just b. Repeating the steps above with the new number of tests gives the number of required tokens as:
Finally, the number of required tokens is inversely proportional to the number of hashes, and hence:
When computing the number of tokens required to extract a watermark, we need a bound on the joint probability of:
For the noiseless case we know that the probability of A is zero (there will never be any hashes in the watermark bin if there is no noise) and so we only had to compute the probability of seeing zero hashes in a bin when the corresponding watermark is absent. But when there is noise, we need to consider both these probabilities.
We use the same notations as earlier, with the following new definitions:
The expected fraction of hashes in the watermark bin, when the fraction of noise is f is given by:
We know that the joint probability of two events will always be less than or equal to the sum of the probabilities of the individual events. Hence, finding the number of tokens required to read back the watermark at a given noise level is equivalent to solving for an n such that there exists a k that satisfies the below inequality:
Getting an explicit expression for n that satisfies the above expression may be challenging, and so an estimation function instead may perform a brute force search over n and k to find the number of tokens that satisfies the above inequality.
Rows Vs. Unique Tokens
The modelling above calculates the number of tokens required to extract the watermark, but this is really the number of unique tokens as there is an implicit assumption that each token that is hashed and added to the array bins is giving us a new piece of information about the token distribution (and hence the watermark pattern embedded within it). It is only possible to embed a watermark if a sufficient diversity of inputs is encountered to allow us to return a range of tokens that touch all bins within the hash space—in the pathological case where we only ever encounter a single input, we would only ever return tokens that would populate a single bin.
It is therefore important to endeavour to add each encountered token to the extraction hash bins only once. This requires the extraction process to keep track of the tokens it has previously encountered, but it is easy to do this using a Bloom filter. Since we know the number of unique tokens that we will be required to add before we expect to be able to extract the watermark, we can size the filter appropriately—choosing a limit of 100,000 values (far in excess of the number of unique tokens we would ever expect to require) and a false positive rate of 10-9 gives a filter that requires only about 525 kb of memory (and note that the extraction process only requires a single instance of this filter regardless of the number of data releases or the algorithm configuration parameters).
The strength of an embedded watermark may be reported to the user as part of a tokenisation job. This strength can be interpreted as the maximum confidence level that an extraction can be performed at and still correctly obtain the watermarked data release (assuming the output file is not doctored in any way), and provides an easily comprehensible summary of whether the processed data contains sufficient unique tokens to carry a watermark.
As shown previously, the token rejection rate is independent of m, and the mean number of tokens needed to extract the watermark grows logarithmically with m. It therefore makes sense to treat m as a dynamic parameter—start with just a single Hash Array instance, and add more as and when additional data release watermarks are required. In this way, the token rejection rate will remain constant and the number of tokens required to extract a watermark is always at (about) the lowest value possible for the required number of data releases, growing only as new data release watermarks are released.
As we've seen, the choices of b and h affect all aspects of the algorithm—the number of data releases that are supported, the token rejection rate, and the number of tokens required to extract the watermark:
However, it can be shown that these quantities are inherently related regardless of the choice of b and h by starting with the token rejection rate equation and rearranging:
Substituting this relation, and the equation for the number of data releases, into the equation for number of tokens gives the fundamental relationship between the quantities:
For a given extraction confidence level, this relationship tells us that the number of tokens required to extract the watermark is a function only of the token rejection rate and the number of supported data releases and is independent of the configuration of the multi hash array.
Since the number of tokens required is reduced by rejecting more tokens, and the acceptable token rejection rate depends on the usage scenario, it implies that we may require multiple different configurations, one for each usage scenario. For our purposes we consider two common usage scenarios—the tokenisation of a bulk dataset and the tokenisation of a much smaller set of results from an interactive database query.
For a traditional use case of tokenising a bulk dataset, the acceptable token rejection rate is capped at 1%. The optimal configuration is therefore reached by choosing parameters that:
Therefore the proposed configuration for this scenario is:
This gives a token rejection rate of just over 0.97%.
For a use case of obtaining a small sample of a dataset, either as a preview of the dataset or as the result of a selective SQL query, we can tolerate a much higher rate of token rejection (since we will have many fewer values that require tokens) but we will need to be able to embed an extractable watermark in much smaller datasets. Therefore the proposed configuration for this scenario is:
This gives a token rejection rate of about 17.8%.
The largest component of the computational cost of the algorithm is in calculating the cryptographic hash. Therefore, it may seem that as h increases, the computational cost of the algorithm will increase significantly. However, thanks to the Kirsch-Mitzenmacher Optimisation it is only ever necessary to calculate a single 64 bit hash, and then multiple 32 bit hashes can be cheaply derived from this through a multiply-and-mod operation without any loss of randomness. Increasing the number of hashes does increase the amount of computation required, but in a reasonably modest way (from benchmarking, the cost of computing fifty hashes when embedding a watermark is ˜1.7× the cost of computing one hash, not 50×).
This optimisation may not apply as we increase m, since each Hash Array instance has its own base key. Therefore, computation time does scale linearly with m. However, this is only of concern when we need to calculate hashes across different Hash Array instances when extracting a watermark. When embedding a watermark (which is the only operation on the tokenisation critical path), we only ever need to test a single Hash Array instance (the one containing the data release we are embedding the watermark for) and so embedding computation time is independent of m.
Extracting a watermark requires the bins to be held in memory whilst the data is traversed, incrementing the bin counts (we also use a single Bloom filter—regardless of the values of h, b, and m—that requires a small amount of memory). Increasing h or m will result in more copies of the bin array being in memory and increasing b will result in the bin array being larger in each copy. The values of b and h are fixed for our scenarios, and far too small to use much memory per Hash Array instance. However, the memory needed to extract a watermark will grow as the number of watermarks that have been embedded grows (i.e. as m grows). Should m reach a large enough number that memory usage becomes a problem, it may be necessary to do multiple passes through the data, partitioning the values of m across them.
Embedding a watermark requires no state to be stored in memory and so is unaffected.
This section presents the results of experiments run against the proposed scheme to validate various aspects of its behaviour. In summary, these results show us that:
Note: all the results presented in this section were obtained using a similar bulk dataset configuration (b=1024 and h=10). However, the observations also apply to datasets having any other configuration, just with the absolute numbers scaled in the way predicted in the previous discussion.
Here we clearly see the effect of the confidence level parameter, which is to bound the rate at which false positives are returned (once we hit the token count at which our modelling tells us that we will need to extract the watermark at 95% confidence—1017 tokens—we always get back the correct data release, but it is accompanied by a false positive result at a rate that never exceeds the alpha level implied by our confidence).
The previous graph demonstrates the use of a 95% confidence level so that the effect of this parameter can be easily visualised, but in a watermark extraction a false positive is an undesirable event—in some situations it may result in accusing an innocent recipient of being the source of a data leak—and so a false positive rate of 5% would be far too high for a real usage. But the confidence level gives us an easily understandable mechanism for reducing the frequency of false positives down to a desired level—if a user specifies a confidence level of 99.9% then they can be sure that a false positive will be returned in no more than 0.1% of cases. And since the score for a true positive data release increases so rapidly with the number of tokens, this additional accuracy comes at only a modest cost to the number of tokens required to extract the actual watermark. This is shown in
The experiment above was repeated, but with an increasing level of random noise progressively added. The graph in
Here we can clearly see that the false positive rate is bounded by the supplied confidence level (and that it does not depend on the level of noise).
The earlier discussion of the watermark extraction algorithm presents a method for estimating the number of tokens that are needed to extract the watermark for a given confidence level and amount of noise added to the input dataset. To test the accuracy of this, the number of tokens required to extract the watermark at 95% confidence were calculated for various noise levels and then an experiment was run for each of these where we attempt to extract the watermark over a dataset of this size 10,000 times and record the outcomes. The results of this experiment are shown in
From this graph we can see that the estimate for the number of tokens required is accurate—we see that the rate of successful extractions tracks the expected confidence level well (and, as usual, the false positive rate never exceeds the expected 5%).
Here we can see that the multiple watermarks are correctly disentangled, even with the addition of random noise. As expected, more tokens are required to extract more watermarks (since from the point of view of one of the watermark bins, the data carrying the other watermark just appears as random noise and so slows down the extraction of that watermark in the way shown in the previous sections).
The false positive rate is not shown in the graphs above but is bounded at 5% as expected.
Our proposed scheme uses the same bin for a data release in each Hash Array. But an alternative scheme would use a different bin in each hash function for the data release, representing the data release as a set of hash function+bin pairs (one for each hash function)—a bin in any given hash function will be used for multiple data releases, but the combination of bins across hash functions will be unique to that data release. As shown in
With this configuration the number of watermarks that are supported is given by:
But this exponential growth in data releases was ultimately the reason that this configuration was rejected. As we have seen, there is a fundamental relationship between the number of tokens required to extract the watermark and the number of data releases that can be supported for a given token rejection rate—therefore, somewhat paradoxically, it is actually advantageous to have a scheme where the growth in the number of data releases is slower so that we can more exactly fit the token rejection rate as close as possible to the allowed 1% bound, thus getting as close as possible to the minimum number of required tokens.
This appendix summarises the key features A-D. Each feature listed can be combined with any other feature A-D. Each optional feature defined below can be combined with any feature and any other optional feature.
Computer implemented process for embedding a digital watermark within tokenised data, comprising the steps of:
Computer implemented process for embedding a digital watermark within tokenised data, comprising the steps of:
Computer implemented process for embedding a digital watermark within tokenised data, comprising the steps of:
Computer implemented process for embedding a digital watermark within tokenised data, comprising the steps of.
A computing device or system adapted to embed a digital watermark within tokenised data, the device or system comprising a processor that is configured to:
It is to be understood that the above-referenced arrangements are only illustrative of the application for the principles of the present invention. Numerous modifications and alternative arrangements can be devised without departing from the spirit and scope of the present invention. While the present invention has been shown in the drawings and fully described above with particularity and detail in connection with what is presently deemed to be the most practical and preferred example(s) of the invention, it will be apparent to those of ordinary skill in the art that numerous modifications can be made without departing from the principles and concepts of the invention as set forth herein.
Number | Date | Country | Kind |
---|---|---|---|
2113485.3 | Sep 2021 | GB | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/GB2022/052401 | 9/22/2022 | WO |