This invention relates to the determination of similarity and dissimilarity between events that can be represented in digital form.
Many common practical situations are not “binary”, identical/different, yes/no, I/O, but rather display varying degrees of similarity. For example, different entities, such as different companies, different departments within a company or agency, might have respective copies of a data set that are in essence identical, but that differ by only a few words or metadata items, such that a strict binary comparison would indicate simply that they are not the same. As another example, two scans of the same finger or iris will in practice never indicate total digital bit-level identity. Similar issues arise in many other contexts as well, for example, data sets from different sources, personal identification information, texts up to a fixed length, computer/financial security/risk events, etc.
One disadvantage of many existing comparison schemes is that they reduce resolution (and hence implicitly security) to be able to provide a useful comparison at all. For example, assume that the coordinates of two versions of a physical signature are measured with 16-bit precision. The probability that the data sets representing the signatures will match bit-for-bit is infinitesimal Most systems therefore consider corresponding point pairs to “match” if their difference lies in some range, which means that less than 16 bits of precision are actually used; it also means that it is more likely that a fake signature will be accepted.
Yet another disadvantage of many existing methods is that they require or lead to leakage of information contained in the data sets being compared; this can be a serious drawback when the data sets include personal information.
In some situations, what is needed is a notion of similarity among the elements in compared data sets, such as in applications that identify similar events, for example biometric identification, entity resolution or record linkage. In some other situations, what is needed is implementation of a notion of dissimilarity of events, such as for intrusion or fraud detection.
In broadest terms, the invention provides a method and system implementation that enables comparison of representations of events submitted to a server by different entities (“clients”), without necessarily having to reveal anything about the underlying “raw” data itself.
As used here, an “event” is any data set, just a few examples of which are data at rest, such as stored files, data obtained from processing, data obtained by conversion of analog information or directly input in digital form from entities such as other computing systems, sensors, or even human users, etc. In other words, an “event” is anything—physical or logical—that can be represented in the form of a vector of elements, in which each element is a string of binary digits.
The data sets may represent the same thing, such as measurements of physical objects (for example, fingerprints, retinas or irises of eyes, images, etc.), but where the representations in digital form may in general not be identical, simply by the nature of the measurement process. The invention also comprises a method of coordination and communication between physical computing systems to carry out similarity or dissimilarity determination with at most negligible information leakage.
According to one feature of the invention, client systems Ci (see
After encoding the data the respective client encrypts it to protect its content from disclosure but in such a way that similarity comparison over the encrypted data is enabled. To accomplish this, embodiments employ an approximate equality operator, which can then be used to build applications, such as for anomaly detection or record linkage. For anomaly detection, the invention—in particular, a server or “service provider” 100 (see
Embodiments may operate with two different types of entities, each of which may comprise more than one party/system. One entity type (a “client” 200) supplies encrypted data elements and the other entity type (the service provider 100) compares encrypted elements. The entity/-ies performing the comparison will learn the result of each comparison, i.e., whether two elements are similar or not, but nothing substantial beyond that. They may share this information with the parties supplying the encrypted elements, or only the result of additional analyses after those comparisons.
Embodiments use a novel, secure, approximate equality operator, which takes, as inputs, integer vectors {right arrow over (x)} and {right arrow over (y)}. The approximate equality operator first evaluates a similarity function (an inverse distance) s({right arrow over (x)}, {right arrow over (y)}) between {right arrow over (x)} and {right arrow over (y)} and then compares s({right arrow over (x)}, {right arrow over (y)}) to a threshold t. If s({right arrow over (x)}, {right arrow over (y)})≥t, then the operator outputs (approximately) “equal”; otherwise, it outputs “not equal”. A preferred relationship enabling determination of the threshold t, is described below.
The secure approximate equality operator reduces the information revealed about the vectors {right arrow over (x)} and {right arrow over (y)}. For an ideally secure operator, the party executing the operator should learn only the result, i.e., “equal” or “not equal”, but nothing else. This implies that the party executing the operator does not learn the plain vectors {right arrow over (x)} and {right arrow over (y)}, but rather only some transformed form of the vectors that the party cannot reverse. In one embodiment, the secure approximate equality operator works in a relaxed security model, where no information about {right arrow over (x)} and {right arrow over (y)} is revealed if the output is “not equal” but some information is revealed if the output is “equal”.
Two types of information may possibly be leaked: 1) more precise information about where two vectors are similar/dissimilar, for example, element indices where they are the same/differ, and 2) information about the encryption, which may, depending on the chosen encryption scheme, be used to recover information such as 1) about other vectors. This leakage can be contained, however, by re-keying the equality operator, that is, the parties choose a new key to use for subsequent vector pairs. This re-keying causes information such as 1) to be lost, but information 2) becomes useless. This embodiment has the advantage that it is much faster than known ideally secure operators.
Formalization
Let λ be a security parameter that bounds a computational adversary, that is, λ is such that an adversary is not able to defeat the scheme given limited computational resources. The secure approximate equality operator consists of three, possibly probabilistic, polynomial-time operations:
A secure approximate equality operator is correct, if
∀λ,{right arrow over (x)},{right arrow over (y)},t
K←KeyGen(1λ)
{right arrow over (c)}1←Encode(K,{right arrow over (x)},t)
{right arrow over (c)}2←Encode(K,{right arrow over (y)},t)
Pr[s({right arrow over (x)},{right arrow over (y)})<t∧Compare({right arrow over (c)}1,{right arrow over (c)}2)=T]=negl(λ)
Pr[s({right arrow over (x)},{right arrow over (y)})≥t∧Compare({right arrow over (c)}1,{right arrow over (c)}2)=#]=negl(λ)
In words, if the original vectors {right arrow over (x)}, {right arrow over (y)} are not sufficiently similar, then the likelihood that the comparison function Compare indicates that their encoded transformations are equal should be nil (the ideal case), or at most negligible (negl); conversely, if {right arrow over (x)} and {right arrow over (y)} are sufficiently similar, the probability that Compare indicates they are not should also be nil or at most negligible. In short, there should be at worst a negligible probability of a “false positive” or “false negative”. The acceptable probability of failure is determined by the parameter λ, which the system designer may choose according to the degree of probability of failure deemed acceptable in any given implementation of the invention.
Without loss of generality, one could formulate these statements with respect to distance instead of inverse distance, that is, set the threshold relative to similarity rather than dissimilarity. The term “similar” as used herein should therefore be taken to mean even “dissimilarity” inasmuch as the procedures to determine either are essentially the same, such that dissimilarity can be considered a form of “negative similarity”.
Let ({right arrow over (x)}, {right arrow over (y)}) be the information about {right arrow over (x)} and {right arrow over (y)} leaked by executing the secure approximate equality operator and denote computational indistinguishability of two ensembles E1 and E2 as E1
E2. We say an approximate equality operator is
-secure, if there exists a simulation function (a “simulator”) Sim(
({right arrow over (x)}, {right arrow over (y)})) such that
∀{right arrow over (x)},{right arrow over (y)},t
K←KeyGen(1λ)
{right arrow over (c)}1←Encode(K,{right arrow over (x)},t)
{right arrow over (c)}2←Encode(K,{right arrow over (y)},t)
Sim(({right arrow over (x)},{right arrow over (y)}))
Compare({right arrow over (c)}1,{right arrow over (c)}2)
In words, the simulator Sim, which is a function of leakage ({right arrow over (x)}, {right arrow over (y)}), which in turn is a function of the “raw” information in {right arrow over (x)}, {right arrow over (y)}, should produce the same output as Compare, which is a function of the encoded information in {right arrow over (x)}, {right arrow over (y)}, that is, of {right arrow over (c)}1, {right arrow over (c)}2. This implies that Compare cannot leak more information than the simulator Sim is given.
Let xi be the i-th entry of vector {right arrow over (x)} and yi be the i-th entry of vector {right arrow over (y)}. Let HK*:
* thus returns at least four pieces of information in the case (and only in the case) two event vectors exceed the similarity threshold of 2t−n, namely, the “equality” indication T, as well as the indices of non-matching elements and the values of the transformation functions of non-matching vector element pairs. The indicators T/⊥ are the outputs as such, whereas the other values are inferable from the computation.
The leakage property reflects a tradeoff that leads to much greater efficiency than functional (inner-product predicate) encryption. What is leaked (not the output as such) is a part of the encryption key where the elements differ. The H( ) values are derived from the encryption key and, if they are available, for all i and xi, they are sufficient to encrypt or decrypt an input. The keys are leaked if the input differs in some, but few (determined by the chosen threshold), indices i, such that they still are recognized as similar but not identical. If, however, there is insufficient information to determine similarity, then no information about the keys is leaked.
This “relaxed” leakage function * has properties that differ from the “binary” leakage function
described above. First,
* provides information in addition to the solely binary T/⊥ information, but only in case of a match; otherwise, no additional information is revealed. More particularly,
* includes the two types of leakage information mentioned above in that i corresponds to leaked information type 1) and the rest of the expression corresponds to leaked information type 2). Second, in one specific instance of encoding/leakage, an attacker would be able to recover the leakage in case s>2t−n, but some methods that run in polynomial time, such as the known Berlekamp-Welch algorithm, would be able to recover the match only if s>t since they would need additional information, whereas the worst-case attack may require exponential time in n. Other polynomial-time algorithms, such as known list decoding methods, which, instead of outputting a single possible message, output a list of possibilities which “successful decoding” is when one of which is correct), may, however, under some circumstances, be able to recover a match for the case s>t′ where t′<t.
As is well known in the field of cryptography, a message authentication code (MAC) is a cryptographic function of input data that uses a session key to detect modifications of the data. The simplest version of a MAC takes two inputs: a message and a secret key and generates an output that is sometimes referred to as a “tag”. The MAC is non-invertible, that is one-way, meaning that one cannot determine the input message from the tag, even if one knows the secret key used in generating the tag. On the other hand, given the same inputs, a standard MAC is repeatable, that is, given the same inputs it will generate the same output. It is also collision resistant, meaning that it is computationally very difficult to find two dissimilar inputs that lead to the same MAC output. A verifier who also possesses the key can thus use it to detect changes to the content of the message in question. In most common uses of a MAC, the MAC routine itself generates the secret key for a message. In embodiments of the invention here, however, the clients 200 select the key to be used when transmitting their messages for comparison.
Let MACK(⋅) be a message authentication code using key K and denote the key generation function for the message authentication code MACK(⋅) as KeyGenMAC.
To implement the encoding, embodiments may, for example, use linear error-correcting codes, e.g., Hamming or Reed-Solomon codes, as described in more detail below. A codeword in a Reed-Solomon code may be used as a secret share in the known Shamir secret-sharing scheme. For simplicity, an embodiment is described below that uses Shamir's secret shares, but any codewords from a linear error-correcting code could be used instead.
As a summary, Shamir's scheme builds on the observation that a polynomial of degree m≤n may be uniquely determined given the values of any set of at least n+1 distinct points that satisfy the polynomial expression. For example, knowledge of three distinct points that lie on a parabola is sufficient to completely determine the parabola. If more than n+1 distinct points are provided, however, then any set of n+1 of these may be used to determine the polynomial even if the other points remain secret. At the same time, knowledge of fewer than n+1 values will leave the polynomial undetermined and provide no more information than knowledge of no points at all, or random numbers.
More formally, let SSτ,s(z) be the polynomial Σi=1ταi zi+m over group as used in Shamir's secret sharing to generate secret shares for identifier z and secret m. Given τ+1 secret shares from the same polynomial SSτ,s(z) one can reconstruct secret m, for example, efficiently using Lagrange interpolation, but τ shares are indistinguishable from τ uniformly random numbers in
. Given n secret shares, where t=┌τ(n−τ)/2┐ are from the same polynomial SSτ,s(z) and └(n−τ)/2┘ are not, one can reconstruct z in polynomial time using Berlekamp-Welch's algorithm. (As is conventional, ┌ and └ represent the ceiling and floor functions, respectively.) It follows that n≥t>n/2 must hold. Other decoding algorithms for Reed-Solomon codes, such as list decoding, mentioned above, may also be used. For linear error-correcting codes other than Reed-Solomon codes, one must use the corresponding error-correcting recovery algorithm.
Now recall the condition in the definition of *, that is, s({right arrow over (x)}, {right arrow over (y)})>2t−n, as well as the expression t=┌(n−τ)/2┐=┌½(τ+n)┐ relating to the Shamir secret shares shown above. In this embodiment, the threshold t may thus be determined as a function of τ+n.
Shamir's secret sharing is linear (as the error-correcting code is linear). Given secret shares σ=SSτ,s(z) and σ′=SSτ,s′(z), then σ+σ′=SSτ,s+s′(z). Given al secret shares from SSτ,s+s′(z) one can reconstruct the sum of the corresponding secrets m+m′.
Now let (α, β) be the equality function, i.e.,
(α, β)=1 if α=β and 0 otherwise. The secure approximate equality operator may then implement the following similarity function s.
s({right arrow over (x)},{right arrow over (y)})=Σi=1n(xi,yi)
One way to define the secure approximate equality operator is as follows:
Correctness: If xi=yi, then MACMACK(i)(xi)=MACMACK(i)(yi) and ρi is a correct secret share on a polynomial SS2t-n,m+m′(z). Hence, if at least t elements ρi are correct secret shares, then reconstruction of m will be successful. However, if xi≠yi then ρi is uniformly distributed in G. If more than n−t elements ρi are uniformly random, then reconstruction of m will be unsuccessful.
Relaxed Security: The leakage *({right arrow over (x)}, {right arrow over (y)}) is sufficient to simulate the secure approximate equality operator; here, the term “simulate” is used in the sense that there exists a simulation function Sim(
*({right arrow over (x)}, ŷ)) that produces an output that an adversary cannot distinguish from a real execution, i.e., the defender simulates a real execution (without using the secret information). If at most 2t−n elements in the vector are equal, then at most 2t−n secret shares are on the polynomial SS2t-n,m+m′(z) that are indistinguishable from uniform random numbers. Since the other elements are not equal, their secret shares are also indistinguishable from random numbers. Hence, the entire vector is indistinguishable from random numbers. If more than 2t−n elements in the vector are equal, then the reconstructor learns ρi=σi+σ′i+MACMACK(i)(xi)−MACMACK(i)(yi) and the recovery (for example, using Berlekamp-Welch's) algorithm also outputs σi+σ′i. The leakage
*({right arrow over (x)}, {right arrow over (y)}) includes MACMACK(i)(xi)−MACMACK(i)(yi). The secret share σi+σ′i is identically distributed to a random secret share SS2t-n,m(i).
The invention thus combines encryption of the original data sets {right arrow over (x)} and {right arrow over (y)} with a secret-sharing procedure. This enables the system to compare the data sets without revealing them, with partial revealing of information only when the data sets have been determined to be sufficiently similar.
System Implementation
The client(s) and server communicate among themselves over any conventional wired or wireless network using a preferably secure and authenticated channel, one example of which is the known Transport Layer Security (TLS) cryptographic protocol.
In a key agreement phase, the clients communicate among each other to choose a common secret key. In a simple implementation, one client may choose the key and distribute it to the other clients.
Any known method may be used to choose which client is assigned/allowed the task of proposing a key. One method could be simply the first client to submit a key to the others. Another method would be to choose the client whose identifier, for example, MAC or other address or identifier, when used as an input to an agreed upon randomizing function, produces an output that meets some selection criterion, such as being the lowest or highest or closest to some other value. If the set of clients is fixed, they could takes “turns” choosing the current key, following a predetermined order.
It would also be possible to allow multiple clients to choose the current key, including key agreement protocols, such as the well-known Diffie-Hellman scheme or its variants for multiple parties. Multiple keys might also arise in a first-to-propose scheme as a result of network delay. In such cases, any consensus mechanism may be implemented in order for the clients to come to agreement concerning which is to select the current key.
In still other systems, the clients may be subordinate to some administrative system, which itself chooses which client should select the current key, using any criteria the system designer chooses to implement. The server 100 could also be called upon to select which client is to choose the current key, although this might raise a concern about possible “collusion”.
After key agreement comes an analysis phase in which the clients send their encoded data to the server for similarity or dissimilarity detection. The security objective is that the server learns whether two events are similar or dissimilar but nothing else about the events encoded in the data.
In case similar events leak additional information, a fresh key can be chosen by the clients after at least one similar pair of events has been discovered by the server. This reduces the accumulated leakage of multiple similar pairs of events.
In
Let m1-m8 indicate the following messages:
In states s1, s2, and s3, various operations are to be carried out:
The state transition conditions and actions of the client in this example are as follows:
The state transition conditions and in-state actions of the client in this example are as follows:
The server 100 can collect metadata about the encoded data, for example, source client, arrival time, etc. This information can be used in decisions about events comprised of multiple similar encoded data, for example, clusters.
Application areas where detection of similar or dissimilar data is necessary include but are not limited to private record linkage (PRL) or private anomaly detection (PAD) (e.g., of cybersecurity events). As mentioned above, however, the invention may be used on many other fields as well. The action(s) the clients—or any other concerned entity—chooses to take in Determination of to the server's determination of similarity/dissimilarity will be equally varied. A determination of similarity/dissimilarity to at least the specified degree will, for example, have different interpretations in different use cases. In some cases a remedial action may be taken, such as granting or removing authorization for the clients to access or do something.
In PRL two or more parties have databases or portions of databases about the same entities (e.g., persons visiting a hospital). The records for the same entity are similar but may differ in the exact data, for example, due to data entry errors or format differences—Bob C. Parker and Bob Parker may refer to the same entity—and the schema. The goal of PRL is to identify similar entities but not reveal anything about dissimilar entities. The invention is well-suited to performing PRL.
In PAD, two or more parties have a stream of events, some of which are dissimilar to the clients' normal behavior but similar among clients. This might, for example, be system events resulting from a (concerted) cybersecurity attack on multiple clients. The goal would then be to identify the similar events across clients and report them as a cluster. An example cybersecurity attack could be credential stuffing attacks where attackers attempt the same username/password pair across multiple clients.
Before comparing data for similarity or dissimilarity, as mentioned above, the data is preferably encoded into fixed-length vectors. The distance metric used for similarity or dissimilarity detection may be known at encoding time; any conventional distance metric may be used, some of which are mentioned above.
Encoding may be performed using machine learning (ML). In ML there is a training phase and an inference phase. Data from an expected distribution, that is, training data, should be known. Using this data, a ML model may be trained in any known manner, such as a neural network being trained using contrastive learning.
In case of anomaly detection, a further preprocessing step may be applied. First, an encoder-decoder network may be trained using the normal data. This network is then fed normal and previous anomalous events. Anomalous events not used during the training will tend to have a high decoder reconstruction error, which may be used as the training data for a subsequent ML model trained using contrastive learning as before. During operation of the system, the inference of the ML used. The clients' data will then be fed into the model (including the encoder-decoder network for anomalous events). The output is the encoding that will be protected by the key and sent to the server.
This application claims priority of U.S. Provisional Patent Application No. 63/284,294, filed 30 Nov. 2021.
Number | Name | Date | Kind |
---|---|---|---|
11082220 | Saad | Aug 2021 | B1 |
20090136024 | Schneider | May 2009 | A1 |
20180157703 | Wang | Jun 2018 | A1 |
20180278410 | Hirano | Sep 2018 | A1 |
20200389306 | Dolan | Dec 2020 | A1 |
20210367786 | Sheets | Nov 2021 | A1 |
20220060319 | Patel | Feb 2022 | A1 |
Entry |
---|
M. Kuzu, M. S. Islam and M. Kantarcioglu, “Efficient Similarity Search over Encrypted Data,” 2012 IEEE 28th International Conference on Data Engineering, Arlington, VA, USA, 2012, pp. 1156-1167, doi: 10.1109/ICDE.2012.23. (Year: 2012). |
Number | Date | Country | |
---|---|---|---|
20230171092 A1 | Jun 2023 | US |
Number | Date | Country | |
---|---|---|---|
63284294 | Nov 2021 | US |