Method for Reconciling Similar Data Sets

Description

BACKGROUND

A need exists for a method to efficiently synchronize data sets between network devices having related data stored therein. The need is particularly acute in disconnect, intermittent, and low-bandwidth environments.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 shows a diagram illustrating an embodiment of a system configuration for use with the methods disclosed herein.

FIG. 2 shows a diagram illustrating an embodiment of a host device that may be used in the system shown in FIG. 1.

FIG. 3 shows a diagram illustrating updated data fields and corresponding documents stored in separate host devices.

FIG. 4 shows a diagram illustrating a data set partitioned into subsets such that elements within each subset are within a specified distance of each other.

FIG. 5 shows a flowchart of an embodiment of a method in accordance with the methods disclosed herein.

DETAILED DESCRIPTION OF SOME EMBODIMENTS

Reference in the specification to “one embodiment” or to “an embodiment” means that a particular element, feature, structure, or characteristic described in connection with the embodiments is included in at least one embodiment. The appearances of the phrases “in one embodiment”, “in some embodiments”, and “in other embodiments” in various places in the specification are not necessarily all referring to the same embodiment or the same set of embodiments.

Some embodiments may be described using the expression “coupled” and “connected” along with their derivatives. For example, some embodiments may be described using the term “coupled” to indicate that two or more elements are in direct physical or electrical contact. The term “coupled,” however, may also mean that two or more elements are not in direct contact with each other, but yet still co-operate or interact with each other. The embodiments are not limited in this context.

As used herein, the terms “comprises,” “comprising,” “includes,” “including,” “has,” “having” or any other variation thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, article, or apparatus that comprises a list of elements is not necessarily limited to only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Further, unless expressly stated to the contrary, “or” refers to an inclusive or and not to an exclusive or.

Additionally, use of the “a” or “an” are employed to describe elements and components of the embodiments herein. This is done merely for convenience and to give a general sense of the invention. This detailed description should be read to include one or at least one and the singular also includes the plural unless it is obviously meant otherwise.

The embodiments disclosed herein resolve the problem of synchronizing two sets of data where the size of the symmetric difference between the sets is small and, in addition, the elements in the symmetric difference are related through a metric such as the Hamming distance metric.

FIG. 1 shows a diagram 100 illustrating an embodiment of a system configuration for use with the methods disclosed herein. Suppose two host devices, Host A 110 and Host B 120, each have a set of length-n q-ary strings, where n and q may comprise any positive integer. As used herein, the term “host” or “host devices” refers to any device that may be connected to a communications network 130, such as a local area network (LAN) or wide area network (WAN), and has a plurality of document/files stored therein, such as shown in FIG. 3. Examples of host devices include, but are not limited to switches, routers, servers, hubs, gateways, network interface controllers, modems, computers, mobile devices, and storage devices.

FIG. 2 shows a diagram illustrating an embodiment of a host device 200 that may be used in the system shown in FIG. 1. Device 200 includes a processor 210, memory 220, and transmit/receive device 230 connected to a bus 240. Memory 220 has a plurality of documents/files 222 stored therein, along with a two-dimensional hash 224 and software 224 for encoding and decoding. It should be recognized that memory 220 may further contain other software modules stored therein for performing some or all of the steps of the methods disclosed herein.

Let S_Adenote the set of strings on Host A 110 and let S_Bdenote the set of strings on Host B 120. The set reconciliation problem is to determine the minimum information 140 and 150 that must be sent from Host A 110 to Host B 120 with a single round of communication so that Host A 110 and Host B 120 can compute their symmetric difference S_AΔ S_B=(S_A\S_B) ∪ (S_B\S_A) where |S_AΔ S_B|≦t. Disclosed herein is a variant of the traditional set reconciliation problem whereby the elements in the symmetric difference S_AΔ S_Bare related. In particular, some embodiments involve a setup where this symmetric difference can be partitioned into subsets such that elements in each of these subsets are within a certain Hamming distance of each other. The disclosed embodiments provide transmission schemes that minimize the amount of information exchanged between two hosts.

This model is motivated by the scenario where two hosts are storing a large number of documents/files, some or all of which may be large in file size. Under this setup, information is rarely or never deleted so that each database contains many different versions of the documents. Each document may have a fixed number of fields and each field may have a fixed size. When synchronizing sets of documents between two hosts, a set of hashes is produced for every document on both hosts. As an example, the hashes performed may include the CRC-32 redundancy check or the MD5 checksum. For every document, a single hash is then formed by concatenating in a systematic fashion the result of hashing each field of the document.

Suppose h_a=(0, 9, 5, 4, 3) ∈ custom-character ₁₀⁵is a non-binary string vector that is the result of performing the hash described above on document a. Suppose a single field on document a is updated resulting in the document a′ and that h_a′=(0, 9, 5, 4, 5) ∈₁₀⁵is a non-binary string vector representing the hash for a′. By the previous discussion, h_aand h_a′ differ only in the portion of h_a′ which corresponds to the field that was updated. Further, the Hamming distance between h_aand h_a′ is one.

FIG. 3 shows a diagram 300 illustrating updated data fields and corresponding documents stored in separate host devices. A first host A may have a database A 310 having a plurality of documents stored therein, while a second host B may have a database B 330 having a plurality of documents stored therein. The documents stored in database A 310 and database B 330 are related in content, such as being different versions of the same document. In such different versions, only the data within the documents may vary in only one or a few fields.

The documents stored in database A 310 and database B 330 each have a fixed structure. For example, the documents may have a fixed number of fields such as Field 1 to Field M as shown. Further, the documents may contain images in addition to text. As an example, if the fields represent images, the hash may be run on the binary data representing the hash itself.

As shown, database A 310 has a document 312 stored therein, which corresponds to row 314 shown. Database A 310 further has a document 316 stored therein corresponding to row 318 and a document 320 stored therein corresponding to row 322. Database B 330 has a document 332 stored therein corresponding to row 334.

Document 316 differs in content from document 312 in that there is new data included within Field 1 of document 316 compared to document 312, as shown by the corresponding fields in rows 314 and 318. Document 320 differs in content from document 316 in that there is new content included within Field 2 of document 320 compared to document 316, as shown by the corresponding fields in rows 322 and 318. Document 320 further differs from document 312 in that document 320 contains new data in both Field 1 and Field 2 compared to document 312, as shown by the corresponding fields in rows 322 and 314. Document 332 differs in content from documents 312, 316, and 320 in that there is new data in Field M of document 332 compared to documents 312, 316, and 320, as shown by the corresponding fields in rows 334, 314, 318, and 322.

String vectors may be created for each document within the databases. String vectors may comprise any set of non-binary numbers. As an example, the string vectors comprise the output of concatenating hash functions together. For instance, if the document has the following name-value pairs ((ID, 2) (Location, 2)), and F is any hash function, then the string vector is (F(ID, 2), F(Location, 2)).

Motivated by this setup, the embodiments disclosed herein resolve the problem of reconciling sets of data elements, in particular where subsets of data elements in the symmetric difference are within a bounded Hamming distance from each other. FIG. 4 shows a diagram 400 illustrating a data set 410 which represents the symmetric difference, S_AΔ S_B, between two host devices. The circles in data set 410 represent data from a first host, such as Host A 110, and the triangles represent data from a second host, such as Host B 120. Data set 410 is partitioned into subsets 420, 430, and 440, with each subset having at most six elements. It should be recognized that subsets may contain more or less than six elements depending upon the particular system configuration.

The fact that the elements in the symmetric difference can be partitioned into subsets is a property of the data that is being synchronized and it is a function of the fact that the documents are related. Within the subsets 420, 430, and 440, the separation of the data elements corresponds to how different in content the data is from each other. Data 450 is not partitioned into a subset, as the respective data elements from Host A 110 and Host B 120 do not differ, as shown for example by data elements 452, which are shown as an overlapping circle and triangle.

Using subset 420 as an example, any pair of data elements, such as data elements 422 and 424, are located within a specified Hamming distance d of each other. In some embodiments, the distance d may be specified in advance by a manufacturer or user/operator of the system. As an example, d may be a number less than four, such as three, but may be any number. If the Hamming distance between elements in the symmetric difference is lower, better compression is achieved.

For two strings x, y ∈ custom-character _q^a, let d_H(x,y) denote their Hamming distance. We denote the Hamming weight of x as wt(x). We assume q is a constant. Let S_A⊂ GF(q)ⁿand S_B⊂ GF(q)ⁿ. We say that (S_A,S_B) are (t,h,1)-sets if S_AΔ S_Bcan be be written S_AΔ S_B={X_1,1, . . . , x_1,k1} ∪ {x_2,1, . . . , x_2,k2} ∪ . . . ∪ {x_j,kj, . . . , x_j,kj}, where j≦t, for 1≦i≦j, k_i≦h, and for any u,w ∈ {x_i,1, . . . , x_i,ki}, we have d_H(u,w)≦1. As an example, suppose S_A,S_B∈ GF(2)⁵where S_A={(0,0,0,0,0),(1,0,1,1,1)} and S_B{(0,0,0,0,0),(1,1,0,0,1)}. Then we say that (S_A,S_B) are (1,2,3)-sets since S_AΔ S_B={(1,0,1,1,1),(1,1,0,0,1)} can be decomposed into 1 set of size 2 whereby the Hamming difference between any two elements is at most 3.

Disclosed herein are transmission schemes for the problem of reconciling (1,h,l)-sets, where |S_AΔ S_B|≦h and for all u,w ∈ S_AΔ S_B, we have d_H(u,w)≦l. Discussed below is the encoding procedure that is performed on Host A. Also discussed is the decoding procedure, which is performed on Host B. The goal, after the decoding procedure, is to compute S_AΔ S_Bwhere (S_A, S_B) are (1,h,l)-sets consisting of elements from GF(q)ⁿ.

The idea behind the encoding and decoding is to encode the symmetric difference S_AΔ S_Bby specifying one element say X ∈ S_AΔ S_Band then specifying the remaining elements in S_AΔ S_Bby describing their location relative to X As a result, as will be described shortly, the information transmitted from Host A to Host B can be decomposed into two parts denoted w₁and w₂. The information in the w₁part describes the locations of the elements in S_AΔ S_Brelative to X. The information in the w₂part will be used to fully recover X. Once X is known and the locations of the other elements in S_AΔ S_Bare known relative to X, then the symmetric difference S_AΔ S_Bcan be recovered.

Some useful notation is first introduced. An [n,d]_qcode is a linear code over GF(q) of length n with minimum Hamming distance d. Suppose r is a positive integer where r<n. Let α be a primitive element in GF(q′) where q is prime. Furthermore, let H be an r×n matrix with elements from GF(q). Suppose S={x₁,x₂, . . . , x_s} ⊂ GF(q)ⁿ. For shorthand, we denote the set {H·x₁,H·x₂, . . . , H·x_s} as H·S. We define S_H,iwhere 1≦i≦q′ so that S_H,i={x ∈ S:H·x=αⁱ}, where with an abuse of notation, α^qr=0. We refer to the j-th element in S_H,i, when ordered in lexicographic order, as S_H,i,j. Finally, let I_H:{GF(q)ⁿ}→ custom-character _q_n^q^rbe defined as I_H(S)=(|S_H,1|,|S_H,2|, . . . , |S_H,qr−₁|,|S_H,0|), for S ⊂ {GF(q)ⁿ}. The following example is provided for illustration.

Suppose q=2,n=3, S={(0,0,0), (1,1,0), (1,0,1), (0,0,1)}, and

$H_{1} = (\begin{matrix} 1 & 0 & 1 \\ 0 & 1 & 1 \end{matrix}) .$

Representing the elements of GF(4) as α¹=(0,1)^T, α²=(1,1)^T, α³=(1,0), and α⁴=(0,0)^T, we have I_H1(S)=(1,2,0,1). In this case, S_H1,1={(1,0,1)}, S_H1,2={(0,0,1),(1,1,0)}, S_H1,4={(0,0,0)}, and S_H1,2,2=(1,1,0). To describe the encoding (and subsequent decoding) procedure, the following matrices are used:

1) H·∈ GF(q)^r×n, for some positive integer r, is the parity check matrix for an [n,2′+1]_qcode C_l;

2) H1 ∈ GF(2)u×qr, for some positive integer u, is the parity check matrix for an [q^r,2h+1]₂code C₁;

3) H_F∈ GF(q)^n×n, and H⁻·∈ GF(q)^(n−r)×nare such that

$H_{F} = (\begin{matrix} H_{} \\ {\overline{H}}_{} \end{matrix})$

has full rank.

In addition to the matrix H⁻_l, one more tool is required to encode w₂. Some additional notation is first introduced. Let b=(b₁,b₂, . . . , b_m) be a sequence of length m with elements from GF(q^n−r) such that for any positive integer k where k≦s Σ_j=1^ka_jb_i_j≠0 where 1≦i₁<i₂< . . . <i_k≦m are distinct and {a₁,a₂, . . . , a_k} ⊂ {−1,1}. Then, the sequence b is referred to as a B_ssequence. Notice that a B_ssequence can be formed from the columns of a parity check matrix for an [m,d]_qcode with dimension n−(n−r) provided d≧s+1.

Discussed below is an embodiment of the encoding procedure followed by the decoding procedure. For encoding, the following procedure may be performed on both Host A and Host B. For shorthand, the set S_Aor S_Bis referred to as S. The operations in step 3) take place over the field GF(q^n−r) where n−r>r and also that m>q^rwhere b=(b₁,b₂, . . . , b_m) is a B_hsequence.

1) Let z=I_H·(S) mod 2;

2) Define w₁=H₁·z; and

$3) Let w_{2} = \sum_{i = 1}^{q^{r}} b_{i} \cdot \sum_{j = 1}^{\langle S_{H_{, i}} \rangle} {\overline{H}}_{} \cdot S_{H_{}, i, j} .$

The information (w₁,w₂) is then transmitted from the host device performing the encoding to the other host device.

For decoding, suppose (q₁^A, w₂^A) is the information transmitted by Host A to Host B and suppose (w₁^B, w₂^B) is the result of the encoding procedure if it is performed on Host B. Next, it is illustrated how to recover S_AΔ S_Bgiven (w₁^A, w₂^A), (w₁^B, w₂^B). The decoding procedure has two broad stages. In the first stage, the locations of the elements in S_AΔ S_Brelative to some X ∈ S_AΔ S_Bare determined. In the second stage, the element X is recovered. The decoding begins by first recovering the syndromes of the elements in the set S_AΔ S_B. More precisely, as a result of the error correction ability of the code with a parity check matrix H₁, the set S_S={H_L·y:y ∈ S_AΔ S_B} is first recovered. Next, an element is arbitrarily chosen, say X_S∈ S_S. Given this setup, X (described earlier) is precisely equal to the element in X which maps to X_Sunder the map H_Lso that X=X ∈ S_AΔ S_B:X_S=X·H_L.

To determine the locations of the other elements in S_AΔ S_Brelative to X every element in the set S_Sis added to X_S. Let S_L={Y_S+X_S:Y_S∈ S_S}. As will be described below in more detail from the set S_L, the values of the elements in S_AΔ S_Brelative to X can be determined. Next, the value of X is determined by canceling out some of the contributions of the elements in (S_AΔ S_B)\X from the vector w₂.

Suppose D1:GF(2)u→GF(2)qr is the decoder for the code C₁which by assumption has minimum Hamming distance at least 2h+1. D₁takes as input a syndrome and outputs an error vector with Hamming weight at most h. Let D_l:GF(q)^r→GF(q)ⁿbe the decoder for C_L, which has Hamming distance 2l+1. The decoder D_ltakes as input a syndrome and outputs an error vector with Hamming weight at most l. In the following, α is a primitive element of GF(q^r).

1) Let {circumflex over (z)}= custom-character ₁(w₁^A+w₁^B).

2) Suppose {circumflex over (z)} has is in positions {k₁,k₂, . . . , k_v}. If {circumflex over (z)}=0, then let F=Ø, and stop.

3) Define ê₂=D_l(α^kl+α^k2), ê₃=D_l(α^k1+α^k3), . . . , ê_v=D_l(α^k1+α^kv).

4) Let z′=w₂^A+w₂^B+Σ_i=2^rb_k_i·H_l·ê_i.

5) Define s₂=z′/(b_k1+b_k2+ . . . +b_kv).

6) Let {circumflex over (x)}=H_F⁻¹·(α^k1,s₂)^T.

7) F={{circumflex over (x)}, {circumflex over (x)}+{circumflex over (x)}₂, . . . , {circumflex over (x)}+ê_v}.

Discussed below is an example illustrating the encoding and decoding procedures.

1) Setup: Suppose Host A has the set S_A={(1,1,1,0,0,0,1),(1,1,0,0,1,1,0), (1,0,0,0,0,1,1),(0,0,0,1,0,0,1)}, and Host B has the set S_B={(1,1,1,0,0,0,1),(1,1,0,0,1,1,0),(1,0,0,0,0,1,1), (1,0,0,1,0,0,1)}. In this case, S_AΔ S_B={(0,0,0,1,0,0,1), (1,0,0,1,0,0,1)} and d_H((0,0,0,1,0,0,1), (1,0,0,1,0,0,1))=1 so that (S_A, S_B) are (1,2,1)-sets. In this case x=(0,0,0,1,0,0,1) ∈ S_A\S_Band y=(1,0,0,1,0,0,1) ∈ S_B\S_A.

2) Encoding: We let z^A, z^Bbe the result of performing step 1) on Hosts A and B respectively. Similarly let w₁^A, w₁^B, w₂^A, w₂^Bbe the result of performing steps 2) and 3) on Hosts A and B respectively. Suppose ζ is a primitive element of GF(8) and β is a primitive element of GF(16) where we use the primitive polynomial x³+x+1 to represent elements over GF(8) as binary vectors and we use the primitive polynomial x⁴+x+1 to represent the elements over GF(16) as binary vectors.

The following matrices may be used:

$H_{l} = {ϚϚ}^{2} Ϛ^{3} Ϛ^{4} Ϛ^{5} Ϛ^{6} Ϛ^{7}) = (\begin{matrix} 0 & 1 & 0 & 1 & 1 & 1 & 0 \\ 1 & 0 & 1 & 1 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 & 1 & 1 & 1 \end{matrix})$

is the parity check matrix for an [8,3]₂code. This matrix plays the same role as the matrix H₁described in the encoding procedure.

$Let H_{1}^{'} = (\begin{matrix} 1 & 0 & 0 & 0 & 1 & 0 & 0 & 1 \\ 0 & 1 & 0 & 0 & 1 & 0 & 0 & 1 \\ 0 & 0 & 1 & 0 & 1 & 0 & 0 & 1 \\ 0 & 0 & 0 & 1 & 1 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 1 & 0 & 1 \\ 0 & 0 & 0 & 0 & 0 & 0 & 1 & 1 \end{matrix}) .$

In this case, H′₁is analogous to the matrix H₁described in the encoding procedure. Notice that H′₁is a parity check matrix for an [8,5]₂code. Let

${\overline{H}}_{} = (\begin{matrix} β^{10} & β^{7} & β^{14} & β^{12} & β^{8} & β^{13} & β^{4} \end{matrix}) = (\begin{matrix} 0 & 1 & 1 & 1 & 0 & 1 & 0 \\ 1 & 0 & 0 & 1 & 1 & 1 & 0 \\ 1 & 1 & 0 & 1 & 0 & 0 & 1 \\ 1 & 1 & 1 & 1 & 1 & 1 & 1 \end{matrix})$

and note that

$(\begin{matrix} H_{} \\ {\overline{H}}_{} \end{matrix})$

has full rank as desired. The B₂sequence b=(β,β², . . . ,β¹⁵) is used for the example.

At step 1) of the encoding, z^A=(0,2,0,1,1,0,0,0) mod 2=(0,0,0,1,1,0,0,0) and z^B=(0,2,0,1,0,1,0,0) mod 2=(0,0,0,1,0,1,0,0) so that z^A+z^B=(0,0,0,0,1,1,0,0). At step 2) of the encoding, w₁^A=(1,1,1,0,0,0) and w₁^B=(0,0,0,1,1,0). At step 3) of the encoding w₂^A=β²·β⁵+β²·β²+β⁴·β¹⁴+β⁵·β⁶. Similarly w₂^B=β²·β⁵+β²·β²+β⁴·β¹⁴+β⁶·β⁷.

3) Decoding: The decoding is performed on Host B and the information (w₁^A, w₂^A), (w₁^B, w₂^B), and S_Bis known. Let D_lbe the decoder for the code with a parity check matrix H_land suppose custom-character ′₁is the decoder for the code with a parity check matrix H′₁. At step 1) of the decoding, z=′₁(w₁^B+w₁^A)=(0,0,0,0,1,1,0,0)=z^A+z^Bas desired. In this case z has 1 s in positions ζ⁵and ζ⁶so we let k₁=6 and k₂=5.

At step 3) of the decoding, ê=D_i(ζ⁵+ζ⁶)=D_i(ζ¹)=(1,0,0,0,0,0,0)=x+y. Now at step 4), we have z⁰=w₂+w₂^B+β⁵·H_k·ê=β⁵·β⁶+β⁶·β⁷+β⁵·β¹⁰=β⁷·(β⁴+β⁶+β⁸)=β⁷·(β⁵+β⁶). At step 5) of the decoding, we now have

$s^{2} = \frac{z}{β^{5} + β^{0}} = β^{7} .$

At step 6), we find that ŷ=(1,0,0,1,0,0,1) since H_l·ŷ=ζ⁶and H_l·ŷ=s₂=β⁷as desired. Then, at step 7), F={(1,0,0,1,0,0,1),(0,0,0,1,0,0,1)}.

Under the procedure described above, 10 bits of information have been transmitted between Host A and B. Alternative methods used require at least 14 bits of information exchange. Less information is transmitted because the method encodes the differences between the elements in the symmetric difference along with only one element in the symmetric difference, rather than encoding all elements in the symmetric difference. Thus, the disclosed embodiments allow for a more efficient transmittal of information between networked devices, which is especially useful in DIL network environments.

FIG. 5 shows a flowchart of an embodiment of a method 500 in accordance with the methods disclosed herein. As an example, method 500 may be performed by system 100 as shown in FIG. 1 using devices 200 as shown in FIG. 2. Also, while FIG. 5 shows one embodiment of method 500 to include steps 510-580, other embodiments of method 500 may contain fewer or more steps. Further, while in some embodiments the steps of method 500 may be performed as shown in FIG. 5, in other embodiments the steps may be performed in a different order, or certain steps may occur simultaneously with one or more other steps.

Method 500 may begin with step 510, which involves providing a first host device 110 connected to a second host device 120 via a communication network 130, the first host device 110 and the second host device 120 each having a plurality of documents stored therein (see documents 222 in FIG. 2 and document within Database A 310 and Database B 330 in FIG. 3). In some embodiments, each of the documents has a fixed number of fields, such as shown in rows 314, 318, 322, and 324 in FIG. 3. The content of the documents stored within the first host device 110 is related to content of the documents stored within the second host device 120. In some embodiments the content of the documents stored within the first host device 110 is related to content of the documents stored within the second host device 120 by a version number, as such documents are earlier/later versions of one another. In some embodiments, each of the documents has a fixed number of fields and each of the fields has a fixed size, such as the “ID” field rows 314, 318, 322, and 324 in FIG. 3. Each of the first host device 110 and the second host device 120 are each configured, via a processor 210 configured to run the appropriate software modules or instructions therein, to perform the steps as discussed below.

Step 520 involves creating a string vector for each document contained therein. The string vector is formed by concatenating a hash of each field of the respective document, where S_ais a set of string vectors for the plurality of documents stored on the first host device 110 and S_bis a set of string vectors for the plurality of documents stored on the second host device 120. In some embodiments, the string vectors are non-binary string vectors, as discussed in the example above.

Step 530 involves encoding, using encode/decode module 226 shown in FIG. 2, the respective set of string vectors S_aor S_busing a two-dimensional hash 224, where a first dimension, w₁, of the two-dimensional hash stores string vector differences between all elements that reside in a symmetric difference, S_aΔS_b, and a second dimension, w₂, of the two-dimensional hash stores one string vector from S_aΔS_b.

In some embodiments, w₁is determined according to the equation w₁=H₁·z, where z=I_H_l(S), wherein set S is a subset of F_qⁿand represents a set of string vectors on either the first host device 110 or the second host device 120, where a matrix H_l∈ F_q^r×nis a parity check matrix for a q-ary code of length n with minimum Hamming distance 2l+1, where if B_r(i) represents an r-bit binary expansion of an integer i, vector I_H_l(S) is formed such that an i-th entry of I_H_l(S) is |{v ⊂ S:H_l·v=B_r(i)}| mod 2, where a matrix H₁∈ F₂^u×q^ris a parity check matrix for a code over F₂, which has minimum Hamming distance 2h+1.

In some embodiments, w₂is determined according to the equation

$w_{2} = \sum_{i = 1}^{q^{r}} b_{i} \cdot \sum_{j = 1}^{\langle S_{H_{l, i}} \rangle} {\overline{H}}_{l} \cdot S_{H_{l}, i, j},$

where matrix

$H_{F} = (\begin{matrix} H_{l} \\ {\overline{H}}_{l} \end{matrix}) \in F_{q}^{n \times n}$

has full rank and H_l∈ F_q^(n−r)×rwhere a sequence b=(b₁, b₂, . . . , b_m) ∈ F_q^n−ris such that for any vector a=(a₁, a₂, . . . , a_m) ∈ {−1,0,1} with at most s non-zero components a·b^T=0 if and only if a is equal to an all-zeros sequence, wherein set S_H_l_,i=|{v ∈ H_l·v=B_r(i)} and a j-th element of S_H_l_,i, when ordered lexicographically, is denoted S_H_l_i,j.

Step 540 involves transmitting, using transmitter/receiver 230, the respective encoded set of string vectors S_aor S_bto the other of the first host device 110 and the second host device 120, such as shown by arrows 140 and 150 in FIG. 1. Step 550 involves decoding, using encode/decode module 226 shown in FIG. 2, the respective encoded set of string vectors S_aor S_breceived from the other of the first host device 110 and the second host device 120. The decoding comprises using w₁to determine the string vector differences between all elements that reside in S_aΔS_b, recovering one string vector in S_aΔS_busing w₂, and using the determined string vector differences between all elements that reside in S_aΔS_band the recovered one string vector to determine S_aΔS_b.

Step 560 involves determining, using processor 210, at each of the respective first host device and the second host device using information from S_aΔS_b, which string vectors are missing from the respective first host device and the second host device, such as by comparing the string vectors with those stored on the host device. Step 570 involves requesting, using processor 210 and transmitter/receiver 230, missing documents pertaining to the missing string vectors from the other of the first host device 110 and the second host device 120. Step 580 involves receiving, shown by arrows 140 and 150 in FIG. 1, the requested missing documents pertaining to the missing string vectors from the other of the first host device and the second host device using information from S_aΔS_b.

Method 500 may be implemented as a series of modules, either functioning alone or in concert, with physical electronic and computer hardware devices. Method 500 may be computer-implemented as a program product comprising a plurality of such modules, which may be displayed for a user.

Various storage media, such as magnetic computer disks, optical disks, and electronic memories, as well as non-transitory computer-readable storage media and computer program products, can be prepared that can contain information that can direct a device, such as a micro-controller, to implement the above-described systems and/or methods. Once an appropriate device has access to the information and programs contained on the storage media, the storage media can provide the information and programs to the device, enabling the device to perform the above-described systems and/or methods.

For example, if a computer disk containing appropriate materials, such as a source file, an object file, or an executable file, were provided to a computer, the computer could receive the information, appropriately configure itself and perform the functions of the various systems and methods outlined in the diagrams and flowcharts above to implement the various functions. That is, the computer could receive various portions of information from the disk relating to different elements of the above-described systems and/or methods, implement the individual systems and/or methods, and coordinate the functions of the individual systems and/or methods.

Many modifications and variations of the disclosed embodiments are possible in light of the above description. Within the scope of the appended claims, the embodiments of the systems described herein may be practiced otherwise than as specifically described. The scope of the claims is not limited to the implementations and the embodiments disclosed herein, but extends to other implementations and embodiments as may be contemplated by those having ordinary skill in the art.

Claims

1. A method comprising the steps of: providing a first host device connected to a second host device via a communication network, the first host device and the second host device each having a plurality of documents stored therein each having a fixed number of fields, wherein content of the documents stored within the first host device is related to content of the documents stored within the second host device, wherein the first host device and the second host device are each configured to perform the steps of:creating a string vector for each document contained therein, wherein the string vector is formed by concatenating a hash of each field of the respective document, where Sa is a set of string vectors for the plurality of documents stored on the first host device and Sb is a set of string vectors for the plurality of documents stored on the second host device;encoding the respective set of string vectors Sa or Sb using a two-dimensional hash, where a first dimension, w1, of the two-dimensional hash stores string vector differences between all elements that reside in a symmetric difference, SaΔSb, and a second dimension, w2, of the two-dimensional hash stores one string vector from SaΔSb;transmitting the respective encoded set of string vectors Sa or Sb to the other of the first host device and the second host device;decoding the respective encoded set of string vectors Sa or Sb received from the other of the first host device and the second host device, wherein the decoding comprises using w1 to determine the string vector differences between all elements that reside in SaΔSb, recovering one string vector in SaΔSb using w2, and using the determined string vector differences between all elements that reside in SaΔSb and the recovered one string vector to determine SaΔSb;determining, at each of the respective first host device and the second host device using information from SaΔSb, which string vectors are missing from the respective first host device and the second host device; andrequesting missing documents pertaining to the missing string vectors from the other of the first host device and the second host device.
2. The method of claim 1, wherein w1is determined according to the equation w1=H1·z, where z=IHl(S), wherein set S is a subset of Fqn and represents a set of string vectors on either the first host device or the second host device, where a matrix Hl ∈ Fqr×n is a parity check matrix for a q-ary code of length n with minimum Hamming distance 2l+1, where if Br(i) represents an r-bit binary expansion of an integer i, vector IHl(S) is formed such that an i-th entry of IHl(S) is |{v ∈ S:Hl·v=Br(i)}| mod 2, where a matrix H1 ∈ F2u×qr is a parity check matrix for a code over F2, which has minimum Hamming distance 2h+1.
3. The method of claim 1, wherein w2 is determined according to the equation
4. The method of claim 1, wherein the string vectors are non-binary string vectors.
5. The method of claim 1 further comprising the step of receiving the requested missing documents pertaining to the missing string vectors from the other of the first host device and the second host device using information from SaΔSb.
6. The method of claim 1, wherein the content of the documents stored within the first host device is related to content of the documents stored within the second host device by a version number.
7. The method of claim 1, wherein each of the fields has a fixed size.
8. A method comprising the steps of: providing a first host device connected to a second host device via a communication network, the first host device and the second host device each having a plurality of documents stored therein each having a fixed number of fields, wherein content of the documents stored within the first host device is related to content of the documents stored within the second host device, wherein the first host device and the second host device are each configured to perform the steps of:creating a string vector for each document contained therein, wherein the string vector is formed by concatenating a hash of each field of the respective document, where Sa is a set of string vectors for the plurality of documents stored on the first host device and Sb is a set of string vectors for the plurality of documents stored on the second host device;encoding the respective set of string vectors Sa or Sb using a two-dimensional hash, where a first dimension, w1, of the two-dimensional hash stores string vector differences between all elements that reside in a symmetric difference, SaΔSb, and a second dimension, w2, of the two-dimensional hash stores one string vector from SaΔSb, wherein w1is determined according to the equation w1=H1·z, where z=IHl(S), wherein set S is a subset of Fqn and represents a set of string vectors on either the first host device or the second host device, where a matrix Hl ∈ Fqr×n is a parity check matrix for a q-ary code of length n with minimum Hamming distance 2l+1, where if Br(i) represents an r-bit binary expansion of an integer i, vector IHl(S) is formed such that an i-th entry of IHl(S) is |{v ∈ S:Hl·v=Br(i)}| mod 2, where a matrix H1 ∈ F2u×qr is a parity check matrix for a code over F2, which has minimum Hamming distance 2h+1, wherein w2 is determined according to the equation
9. The method of claim 8, wherein the string vectors are non-binary string vectors.
10. The method of claim 8 further comprising the step of receiving the requested missing documents pertaining to the missing string vectors from the other of the first host device and the second host device using information from SaΔSb.
11. The method of claim 8, wherein the content of the documents stored within the first host device is related to content of the documents stored within the second host device by a version number.
12. The method of claim 8, wherein each of the fields has a fixed size.
13. A system comprising: a first host device connected to a second host device via a communication network, the first host device and the second host device each having a plurality of documents stored therein each having a fixed number of fields having a fixed size, wherein content of the documents stored within the first host device is related to content of the documents stored within the second host device, wherein the first host device and the second host device each have a processor therein configured to perform the steps of:creating a non-binary string vector for each document contained therein, wherein the string vector is formed by concatenating a hash of each field of the respective document, where Sa is a set of string vectors for the plurality of documents stored on the first host device and Sb is a set of string vectors for the plurality of documents stored on the second host device;encoding the respective set of string vectors Sa or Sb using a two-dimensional hash, where a first dimension, w1, of the two-dimensional hash stores string vector differences between all elements that reside in a symmetric difference, SaΔSb, and a second dimension, w2, of the two-dimensional hash stores one string vector from SaΔSb, wherein SaΔSb is partitioned into subsets such that elements in each of the subsets are within a specified Hamming distance of each other;transmitting the respective encoded set of string vectors Sa or Sb to the other of the first host device and the second host device;decoding the respective encoded set of string vectors Sa or Sb received from the other of the first host device and the second host device, wherein the decoding comprises using w1 to determine the string vector differences between all elements that reside in SaΔSb, recovering one string vector in SaΔSb using w2, and using the determined string vector differences between all elements that reside in SaΔSb and the recovered one string vector to determine SaΔSb;determining, at each of the respective first host device and the second host device using information from SaΔSb, which string vectors are missing from the respective first host device and the second host device; andrequesting missing documents pertaining to the missing string vectors from the other of the first host device and the second host device.
14. The system of claim 13, wherein w1is determined according to the equation w1=H1·z, where z=IHl(S), wherein set S is a subset of Fqn and represents a set of string vectors on either the first host device or the second host device, where a matrix Hl ∈ Fqr×n is a parity check matrix for a q-ary code of length n with minimum Hamming distance 2l+1, where if Br(i) represents an r-bit binary expansion of an integer i, vector IHl(S) is formed such that an i-th entry of IHl(S) is |{v ∈ S:Hl·v=Br(i)}| mod 2, where a matrix Hl ∈ F2u×qr is a parity check matrix for a code over F2, which has minimum Hamming distance 2h+1.
15. The system of claim 13, wherein w2 is determined according to the equation
16. The system of claim 13, wherein the processor is further configured to perform the step of receiving the requested missing documents pertaining to the missing string vectors from the other of the first host device and the second host device using information from SaΔSb.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims the benefit of U.S. Provisional Patent Application Ser. No. 62/351,717 filed Jun. 17, 2016, entitled “Method for Efficient Synchronization of Similar Data Sets”, the content of which is fully incorporated by reference herein.

FEDERALLY-SPONSORED RESEARCH AND DEVELOPMENT

This invention is assigned to the United States Government and is available for licensing for commercial purposes. Licensing and technical inquiries may be directed to the Office of Research and Technical Applications, Space and Naval Warfare Systems Center, Pacific, Code 72120, San Diego, Calif., 92152; voice (619) 553-5118; email ssc_pac_T2@navy.mil; reference Navy Case Number 103761.

Provisional Applications (1)

	Number	Date	Country
	62351717	Jun 2016	US

Method for Reconciling Similar Data Sets

Information

Publication Number

Date Filed

Date Published

Inventors

Original Assignees

CPC

International Classifications

Abstract

Description

Claims

CROSS-REFERENCE TO RELATED APPLICATIONS

FEDERALLY-SPONSORED RESEARCH AND DEVELOPMENT

Provisional Applications (1)