The present invention relates to electronic data transfer, and more particularly to systems and methods for determining variances in datasets and expediting the transfer of data in communication networks by implementing hash segmentation routines.
In computer networking and, in particular, in wireless computer networking, the ability to transfer data efficiently is a paramount concern. Recent developments in computing have made it possible for increasingly larger and larger files; such as executable files, text files, multimedia files, database files and the like, to be stored within a single memory device; making it possible for increasingly smaller portable computing devices to store and implement these large files. In the networking environment these files are transferred from a source host to one or more target hosts via communication medium, such as cable, broadband, wireless and the like.
It is the nature of computer software, data files, data sets and the like that it is often desirable to update or revise the software, data files or data sets in order to add features, update data, correct errors or recognize changes. Sometimes these revisions are extensive and require the entire updated data file or data set to be transferred from the source host to the target hosts. However, in many instances the changes are relatively minor, involving changes in only a small percentage of the data that makes up the file.
In instances where only minor changes to the file have occurred it is inefficient, costly and time-consuming to be burdened with transferring the entire updated file to the target host. If an entire new revised file must be delivered, the amount of data can be substantial. It is not unusual in today's computer network environment for files upwards of 10 Megabytes or larger to exist and require frequent updating. Distribution of such large files across wired or wireless medium can take an undesirably long period of time and can consume a large amount of server resources.
For example, a hand held computing device, such as a personal data assistant, an imager scanner or other portable computing devices are now equipped to implement conventional operating system, like Windows™ (Microsoft Corporation; Redmond, Wash.), that include large datasets. It is often desirable for the administrator of a fleet of these devices or for the device manufacturer to update the data files or data sets residing on the devices. In certain instances, this will require the administrator or manufacturer to send out files or data set updates to every device that exists in the field and it may require the administrator or manufacturer to assess what revisions or what changes have been implemented by the user of a field deployed device.
If the entire file or dataset is updated typically a compression algorithm is employed. These programs typically achieve compression of a large executable file down to between 40 percent to 60 percent of its original size and can compress some types of files even further. However, for very large computer files or collections of files, even a compressed file reduced to 40% still represents a substantial transmission cost in terms of transfer time and server occupancy.
For portable devices that typically rely on a battery power supply, transferring entire files or datasets consumes large amounts of power. As such, the transferring of entire files or datasets in the portable device realm is often restricted by power limitations or requires that the device be connected to a DC power source for the transfer operation.
In instances in which the entire updated file is not transferred, differencing programs or comparator programs have been used to compare an old file to a new revised file in order to determine how the files differ. The differencing or comparator program generates a patch file, and then using that patch file in combination with the old file to generate a newly revised file. While these types of “patching” systems do not transfer the entire updated file they do require that both versions (i.e., the old revision and the new revision) exist on one host for the purpose of comparing, line by line the two versions to determine the “patch”. Additionally, most commonly available versions of these systems are limited to text file comparisons and updating.
More recently hashing algorithms or hashing functions have been used to promote the transfer of data and to implement updated revisions to files. Hashing is the transformation of a string of characters into a usually shorter fixed-length value or key that represents the original string. Traditionally, hashing has been used to index and retrieve items in a database because it is faster to find the item using the shorter hashed key than to find it using the original value. The hash function is used to index the original value or key and then used later each time the data associated with the value or key is to be retrieved. A good hash function also should not produce the same hash value from two different inputs. If it does, this is known as a collision. A hash function that offers an extremely low risk of collision may be considered acceptable.
For example, U.S. Pat. No. 6,263,348, issued to Kathrow et al., Jul. 17, 2001, teaches a method for hashing some or all of the files to be compared and allowing comparison of the hash results to identify whether differences exist between the files. Files that have a different result may be identified as having differences, and files that hash to the same number may be identified as unlikely to have differences. The Kathrow '348 patent teachings are a bi-level approach for partial hashing of files and a comparison of the hash results to isolate differences. Since the process taught in Kathrow '348 is limited to files existing on a single host and a single memory module, the patent does not concern itself with a means of isolating the differences in the files to insure that the amount of data actually transmitted between two hosts is minimized to insure maximum transfer efficiency is realized. Additionally, the patent provides no teaching as to how realignment of segment boundaries may occur after an insertion or deletion has been detected in the file. The concept of realignment provides reconciliation in those applications in which the source host is unaware of the exact data that exists on the target host.
Similar to the Kathrow teachings, U.S. Pat. No. RE 35,861, issued to Queen et al., on Jul. 28, 1998, addresses a method for comparing two versions of a text document by partial hashing of the files and comparison for the purpose of isolating differences. The method taught in the Queen '861 patent is similar to patching, in that it requires both versions of the text document to reside on one host for the purpose of comparison. In addition, the Queen teaching is limited to text files and does not teach an iterative process that isolates the differences to insure the minimal amount of data is transferred between hosts.
By way of another example, U.S. Pat. No. 6,151,709, issued to Pedrizetti et al., on Nov. 21, 2000, teaches a method for comparing software on a client computer against a set of updates on a server computer to determine which updates are applicable and should be transferred to the client. A hash function is applied to update identifiers to generate a table of single bit entries indicating the presence of particular updates on the server. At the client level the same hashing function is applied to program identifiers and corresponding entries of the transferred tables are checked to determine whether the server has a potential update. Thus, the hashing routine is not implemented to determine differences in the file, but rather it is implemented to determine the file identifiers to determine or update the bit filed index.
Therefore, the need exists to develop a method and system for determining differences in datasets and expediting the transfer of data effectively and efficiently between data files existing on separate hosts. In most instances, efficient transfer should significantly reduce the time required to transfer updates and limit the number of transferring resources required of the transferring host. An effective system and method should provide for the transfer of data between source host and target host in instances in which neither host is aware of the revision that exists on the other host. The method and system should effectively isolate the data that has been revised, updated, added or deleted in order to limit the data that is transferred from the source host to the target host.
The present invention provides for an improved method and system for determining variances in datasets, expediting data transfer and data reconciliation in a communication network using hash segmentation approaches. The systems and methods provide for an efficient means of communicating updated files, new revisions or verifying files between a two distinct hosts. By implementing a hash segmentation approach, and in many embodiments an iterative hash segmentation approach, the updates within the files can be isolated for the purpose of minimizing the amount of data communicated from one host to the other host.
In one embodiment of the invention a method for determining the variances in datasets residing on separate hosts in a communication network is defined by the steps of creating first hash values, at a first host, corresponding to a plurality of segments of a first dataset and creating second hash values, at a second host, corresponding to a plurality of segments of a second dataset. Once the hash values have been created a comparison ensues whereby one or more first hash values is compared to the second hash values to determine which segments of the datasets differ. Typically, the hash value comparison will occur at either the first or second host after the first hash values have been communicated from the first host to the second host or after the second hash values have been communicated from the second host to the first host. In an alternate embodiment, the first and second hash values may be communicated to a third host with comparison of the first and second hash values occurring at the third host.
In a specific embodiment of the invention, if the comparison of the hash values determines that one or more segments of the first dataset differ from the second dataset then the host at which the first dataset resides will communicate the one or segments to the second host. The second host will typically compile a third dataset, referred to herein as a target dataset, that comprises those segments determined to differ, which have been communicated from the first host, and those segments determined not to differ, which are transferred from the second dataset residing on the second host. Alternatively, the third dataset may be compiled at the first host or at a third host.
The communication of differing segments from the first host to the second host may occur automatically upon determination of a differing segment or communication may occur after a predetermined threshold of differences has been attained. Additionally, the process may entail the step of determining whether or not the differences should be communicated from one host to another. This determination may be based on the size of the difference (i.e., some differences may be so large as to be impractical to transmit on an available medium of communication) or the determination may be made based on the resources required to communicate the differing segments. In this regard, battery power consumption may prohibit communication of the differing segments of the dataset if the hosts are communicating in a wireless network. Thus, the decision may be made to postpone the communication of the differences while the hosts are in wireless communication until the hosts are in wired communication, via cable, broadband, DSL, modem or the like.
Another novel feature of the present invention is the ability of the method to isolate the differences between the first and second dataset. By isolating the difference(s) the amount of actual data communicated between the hosts can be minimized. For example, once a determination is made that a segment differs between the source dataset and the suspect dataset and the length of the segment exceeds a predetermined length, further isolation will be necessary to identify where in the segment the difference occurs. Isolation of the differences within the segments of the datasets occurs by iteratively creating subsequent hash values for sub-segments of the segments determined to differ. In one embodiment this will require creation of third hash values, at the first host, corresponding to sub-segments of a segment of the first dataset determined to have differed and creation of fourth hash values, at the second host, corresponding to a sub-segments of the corresponding segment of the second dataset determined to have differed. Once the third and fourth hash values have been created a comparison of the values occurs to determine which sub-segment(s) of the segment differ. Iterative isolation may require subsequent hash value creation and comparison to isolate the difference to an acceptable level as dictated by a static or dynamically determined difference threshold. Once differing sub-segments have been isolated to the degree necessary communication of the sub-segments may be required between the hosts and compilation of a third dataset may be required.
In another embodiment of the invention a method for determining differences between datasets residing on separate hosts in a communication network includes the steps of creating first dataset hash values, at a first host, corresponding to segments of a first dataset and searching, at a second host, for segments of a second dataset that have matching hash values to the first dataset hash values using a slide function of a sliding hash algorithm. This method will typically involve creating, at a second host, a first hash value for a first segment of a second dataset and comparing the first hash value of the first segment of the second dataset to the first dataset hash values to determine if the first hash value matches any of the first dataset hash values. If the comparison determines that no match exists for the hash value of the first segment of the second dataset then a slide function is performed. The slide function allows the first segment to move a predetermined length to create a hash value of the second segment. Once hash value is created for the second segment, the second hash value is compared to the first dataset hash values to determine if the second hash value matches any of the first dataset hash values.
In the sliding hash algorithm approach, the segment of the second dataset is iteratively slid, either backward or forward by a predetermined length to define new hash values for segments of the second dataset. In this regard, the slide operation entails realigning the suspect dataset segment by a predetermined length, typically a minimum length, such as 1 byte. Once the slide function occurs, a hash value for the new realigned segment is created and this hash value is compared to all of the entries in the source dataset hash list to determine if a match exists. If a determination is made that the hash value for a segment of the second dataset and a segment of the first dataset are equal then the segment of the second dataset is copied to a third dataset. Upon completion of the iterative slide function segmentation and hash processing, if a determination is a made that no match exists within the first dataset hash list, the first host may communicate all of the unmatched segments from the first dataset to the second host for inclusion in the third dataset.
In accordance with yet another embodiment of the invention, a system for expedited data transfer and data reconciliation is provided. The system comprises a first processor residing in a first host, the first processor implements a hash algorithm to create first hash values corresponding to segments of a first dataset. In addition, the system includes a second processor residing in a second host and in network communication with the first processor, the second processor implements the first hash algorithm to create second hash values corresponding to segments of a second dataset. The second processor compares the first hash values to the second hash values to determine which segments of the datasets differ. The hash algorithm may be a logarithmic hash algorithm, a sliding linear hash algorithm or the like. If the comparison determines that one or more of the first dataset segments differ from the second dataset then the first processor communicates to the second processor the differing segments.
The system may additionally include a compiler, residing in the second host, which compiles a third dataset that includes those segments of the first dataset determined to differ from the second dataset and those segments of the second dataset determined not to differ from the first dataset. In addition the system may include the following features, the second processor may be capable of searching in the second dataset for a match to a subset of one of the first dataset segments communicated from the first host, both processors may work in unison to iteratively determine where, within the segments that have been determined to differ, the differences occur. Determining where the differences occur will typically involve the first and second processors isolating, iteratively, one or more differences within the one or more segments of the first and second datasets determined to have differed.
Therefore, the present invention provides for an improved method and system for expedited data transfer and data reconciliation. The method and systems of the present invention can effectively and efficiently perform data transfer and data reconciliation between data files existing on separate hosts. The resulting efficient transfer of data significantly reduces the time required to transfer updates and limits the number of transferring resources required to perform the transfer operation. The method and system is capable of effectively isolating the data that has been revised, updated, added or deleted in order to limit the data that is transferred from the source host to the target host. In addition the system provides for data reconciliation in those applications in which the neither host is aware of the exact data that exists on the other host.
The present invention now will be described more fully hereinafter with reference to the accompanying drawings, in which preferred embodiments of the invention are shown. This invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art. Like numbers refer to like elements throughout.
The present invention provides for improved methods and systems for determining dataset variance and expediting data transfer in a communication network using hash segmentation processing. The systems and methods provide for an efficient means of communicating updated files, new revisions or verifying files between a source host and a target host. By implementing hash segmentation processing, and in many embodiments iterative hash segmentation processing, the updates within the files can be isolated for the purpose of minimizing the amount of data communicated from the source host to the target host. The hash segmentation process may implement a logarithmic hash segmentation approach, a sliding hash segmentation approach or another suitable hashing approach.
The source host 12 either stores or has access to a first dataset, i.e. the source dataset 16, which may be an updated or revised version of the required dataset. The target host stores or has access to a second dataset, i.e., the suspect dataset 18, which may be an outdated or un-revised dataset, a corrupt dataset, an amended dataset, a completely unrelated dataset or the like. The system of the present invention serves to detect the differences in the second dataset compared to the first dataset and, in most embodiments, compile a target dataset 20 either at the target host or in communication with the target host.
The differences between the source dataset and the suspect dataset are determined by implementing a hashing mechanism that segments the datasets and applies hash values to the segments for subsequent comparison. The source host 12 includes a first processor 22 that implements a hash algorithm 24 to create a hash list corresponding to the segments of the source dataset. The hash algorithm may be a conventional hash algorithm, such as the conventional MD5 algorithm; a sliding hash algorithm, such as the sliding algorithm herein disclosed or the hash algorithm may be any other appropriate hash algorithm. The target host 14 includes a second processor 26 that is in network communication with the first processor such that hash lists and segments of datasets can be communicated between the processors. The second processor implements the same hash algorithm 24 as the source host to create hash values corresponding to segments of the suspect dataset.
A comparator 28 is implemented at the target host to determine if hash values for corresponding segments are equal; signifying that the segments match, or unequal; signifying that the segments do not match. If further segmentation of the unmatched segment is warranted to isolate the differences, further hash segmentation is performed on the segment of the source and target datasets at the respective source and target hosts. This isolation process may continue, iteratively, until the difference is isolated to the degree necessary. This iterative processing may require the comparator 28 to be implemented on the source host, as well as, the target host.
A compiler 30 is implemented at the target host and serves to compile the target dataset from those segments of the suspect dataset determined to match the source dataset and those segments communicated from the source dataset that were determined not to match the corresponding segment of the suspect dataset.
The present invention is also embodied in a method for determining differences in datasets and optionally expediting data transfer and data reconciliation.
Once a determination is made that one or more segments vary, the source host may, optionally at step 130, communicate to the target host the segments determined to vary. The communication of varying segments from the source host to the target host may occur automatically upon determination of a varying segment or communication may occur after a predetermined threshold of differences has been attained. Additionally, the process may entail an optional step (not shown in
At optional step 140, the target host compiles a target dataset that includes the communicated segments from the source host and those segments of the suspect dataset that were determined not to differ from the corresponding source dataset segment.
The method may also be defined by searching within the suspect dataset for a match to a subset of the one or more segments that have been communicated from the source host. This search process provides realignment of data segment boundaries after an insertion or deletion has been detected by the hashing routine. This provides reconciliation in those applications in which the source host is unaware of the exact data that exists in the corresponding file on the target host.
The method may also be defined by isolating, iteratively, one or more variances within the one or more segments of the source and suspect datasets determined to have varied. Iterative isolation of the variances may involve further hash segmentation of the segments that have been determined to vary. For example, once a determination is made that a source dataset segment and a suspect dataset segment varies and the length of the segment exceeds a predetermined or dynamically determined threshold further isolation will be necessary to identify where in the segment the variance occurs. This process will be accomplished by creating, at the source host, a third list of hash values corresponding to sub-segments of the source dataset segment determined to have differed and creating, at the suspect host, a fourth hash value list corresponding to sub-segments of the suspect dataset segments determined to have differed. Once the third and fourth hash segmentation lists have been created the hash lists are compared to determine which sub-segments of the datasets differ. This process may continue iteratively until the differing segment is identified, in length, to the degree necessary.
The invention is also embodied in methods for data transfer and data reconciliation that implement hash segmentation algorithms. Two such embodiments are detailed below and implement, respectively, a logarithmic hash segmentation approach and a sliding hash segmentation approach. The general notion of a sliding hash algorithm is novel and within the bounds of the inventive concepts herein disclosed.
Logarithmic Hash Segmentation Method and System
Referring to
For the sake of complete understanding we define the source host and the target host as follows. The source host is defined as the host at which the source dataset resides or the host which has access to the source dataset. The source dataset is the dataset that requires comparison to other datasets and, which may require some level of transfer to the target host. The target host is defined as the host at which the suspect dataset resides or the host which has access to the suspect dataset. The target host is also the host at which the target dataset is built. The suspect dataset may differ from the source dataset and, if a determination is made that a difference exists, data is transferred from the source dataset to a newly formed dataset, referred to herein as the target dataset.
The data transfer and data reconciliation process begins, at step 200, by creating at the source host a logarithmic hash list of the entire source dataset. By way of example, the hash list may be created by implementing a strong 128-bit hash algorithm, such as the MD5 hash algorithm. The MD5 hash algorithm is well known by those of ordinary skill in the art and creates an entry in the hash list for every 128 bits of data.
Referring to
Once the dataset is segmented, the logarithmic hash list is created by performing a hash algorithm on each of the segments, denoted by the arrows 304. As noted above, one example of such a hash algorithm is the MD5 algorithm. After the hash algorithm is performed each resulting hash value is placed in the Nth position in the hash list, which is the 0-based segment index. An example of the logarithmic hash list is illustrated in column 306 and the segment index is repeated in column 308. Thus, as shown, a one-to-one correlation exists between each segment of the dataset and a corresponding hash list entry. The length of the hash list entry is a function of the hash algorithm that is used.
Those skilled in the art will recognize that reconfiguring the logic or adding additional logic to detect “early out” conditions can enhance the methods described herein. “Early out” conditions refer to conditions in which an algorithm can detect that the goal or task has already been completed, typically earlier than presumed. One such example of an “early out” in the present application is the creation of a 128-bit hash value comparison between the entire source dataset and the entire suspect dataset prior to commencing comparison of individual hash values. If the 128-bit hash value comparison results in the values being equal, the two files or datasets are determined to be equivalent and the process can end prematurely.
Referring again to
Once the suspect dataset hash list has been created a comparison is performed, at step 208, that compares the source dataset hash list to the suspect dataset hash list to determine the first unequal hash list entry. Unequal hash list entries indicate that there is a difference between the source dataset and the suspect dataset within the corresponding segment.
Referring to
Referring again to
Referring to
At step 338, a query is performed to determine if there are remaining segments of the suspect dataset still requiring processing. This query is accomplished by determining if the pointer, P has passed the end of the suspect dataset. If pointer, P has not passed the end of the suspect dataset, then more segments of the suspect dataset remain to be processed, at which point, the flow returns to the composite data transfer flow of
After the source host has received the value of N and the request for data transfer, at step 344, the source host transmits the remainder of the source dataset beginning with the Nth segment. At step 346, the target host receives the remainder of the source dataset and appends the remainder of the source dataset to the end of the target dataset. Once the remainder of the source dataset is appended to the target dataset the data transfer and reconciliation process is complete.
Returning to the composite data transfer and data reconciliation flow of
Thus, if the Nth segment of the suspect dataset is greater than T in length then further isolation of the difference is warranted and, at step 214, the target host creates a hash list of the Nth segment. The creation of the Nth segment hash list will typically implement the same hash algorithm that was implemented for the creation of the overall source dataset and the overall suspect dataset. This will typically mean that the Nth segment hash list will be created by implementing a strong 128 bit hash algorithm, such as MD5 or the like. In this instance, the creation of the Nth segment hash list will be accomplished by the same flow that is illustrated in
After the hash list of the Nth segment of the suspect dataset has been created, at step 216, the target host transmits to the source host the Nth segment value and the hash list of the Nth segment of the suspect dataset. Upon receipt of the Nth segment value and the hash list of the Nth segment of the suspect dataset the source host, at step 218, creates a hash list of the Nth segment of the source dataset using the same hash algorithm that has been used previously. At step 220, a comparison of the two Nth segment hash lists is undertaken to find the first unmatched segment index, N. (It is noted that the previous unmatched segment index, N is dropped and the Nth segment is modified so that it currently refers to segment index, N.) The comparison of the two Nth segment hash lists will be accomplished by the same flow that is illustrated in
After the comparison is accomplished an Nth segment is identified and, at step 222, a determination is made as whether the length of the Nth segment is greater than the length of T. This determination is made to determine if further isolation within the Nth segment is necessary to further isolate the difference prior to communicating the segment from the source host to the target host. If the determination is made that the length of the Nth segment is greater than the length of T then, at step 224, the source host creates a hash list of the Nth segment of the source dataset using the previously implemented hash algorithm. Once the hash list of the Nth segment of the source dataset has been created, at step 226, the Nth index value and the hash list of the Nth segment of the source dataset are transmitted to the target host. Once the Nth index value and hash list of the Nth segment of the source dataset are received by the target host, steps 228, 230 and 208 ensue, whereby matched Nth segments from the suspect dataset are copied to the target dataset, a hash list of the Nth segment of the suspect dataset is created and a comparison of the two Nth segment hash lists is performed to find the first unmatched segment index N. These processes are implemented in accordance with the flows illustrated in
At steps 212 and 222 if a determination is made that the Nth segment is equal to or less than the length of T then the iterative process of further isolation of the difference is ended. If the determination is made at the target host then, at step 232, the Nth index is transmitted from the target host to the source host. Upon receipt by the source host of the Nth index, the source host, at step 234, creates a hash list of the remainder of the source dataset beginning after the Nth segment using the same hash algorithm that has been used previously. The creation of the hash list of the remainder of the source dataset will be implemented using the same flow illustrated in
At step 236, the source host transmits the hash list of the remainder of the source dataset along with the source dataset Nth segment. Once the target dataset receives the hash list and the source dataset Nth segment, at step 238, the source dataset Nth segment is appended to the end of the target dataset. At step 240, a determination is made as to whether the transfer of data is complete. This determination is accomplished by determining if the length of the target dataset is equal to the length of the source dataset. If the lengths are equal then the transfer is complete and the process ends. If the lengths are not equal then further data reconciliation and subsequent data transfer are necessary.
Once a determination is made that the lengths are not equal and, therefore, further data reconciliation is warranted, at step 242, the pointer, P is incremented by the fact data length and, at step 244, the pointer, P′ is set in the suspect dataset to the beginning of the most likely hash list alignment. Pointer, P′ is set in an attempt to realign the hash list comparison, i.e., recover from an insertion or a deletion in the suspect dataset.
Referring to
Returning again to the composite flow of
This flow continues until the entire suspect dataset has been compared to the source dataset via the creation and comparison of hash lists. Once comparisons of the hash lists are made determinations are made to assess whether further isolation of the difference is necessary to insure minimal data transfer between the source host and the target host. If further isolation is warranted an iterative hashing process ensues to isolate the discrepancy. Once all segments have been compared and differences have been isolated, a target dataset will have been assembled that consists of segments of the source dataset that were determined to be different from the suspect dataset and segments of the suspect dataset that were determined not to be different from the source dataset.
Linear Hash Segmentation Method and System
Referring to
The data transfer and data reconciliation process begins, at step 400, by creating at the source host a linear hash list of the entire source dataset. By way of example, the hash list may be created by implementing a 16-bit sliding hash algorithm, which will be described in detail below. The 16-bit sliding hash algorithm is a generally weaker algorithm than the 128-bit hash algorithm but is sufficiently strong for the data reconciliation purpose and provides efficiency for the overall data transfer process.
Referring to
Once the dataset is segmented, the linear hash list is created by performing a hash algorithm on each of the segments, denoted by the arrows 504. As noted above, one example of such a linear hash algorithm is the sliding hash algorithm described below. After the hash algorithm is performed each resulting hash value is placed in the Nth position in the hash list, where N is the 0-based segment index. An example of the linear hash list is illustrated in column 506 and the segment index is repeated in column 508. Thus, as shown, a one-to-one correlation exists between each segment of the dataset and a corresponding hash list entry. The length of the hash list entry is a function of the hash algorithm that is used.
Referring again to
Referring to
If, at step 526, either N is determined to be an invalid hash list index (i.e., the end of the source hash list has been encountered) or the hash value, X, of the suspect dataset segment does not equal the corresponding hash value in the source hash list then, at step 536, a determination is made as to whether X equals any value in the source hash list. In typical embodiments of the invention it will beneficial to sort the hash list prior to performing the step 536 determination so that a binary search of the hash list can be performed to identify matches for, X. If X does not equal any other segment then the slide aspect of the 16-bit sliding hash algorithm is implemented. At step 538, the hash value, X undergoes a “slide” function, whereby the alignment of the suspect segment is moved one byte to the right. At step 540, P is updated to reflect the new segment alignment of the suspect dataset that is being considered for matching.
In accordance with an embodiment of the present invention, a sliding hash algorithm is defined. By way of example, the sliding hash algorithm may be implemented in a simple 16-bit checksum or in a stronger 16-bit shifting XOR that begins with a non-zero seed value.
In the simple 16-bit checksum example the hash value is computed by creating a 16-bit summation of the value of all bytes in the range of interest and discarding any overflow. Implementing the 16-bit checksum algorithm on a subset of a dataset, the subset being N bytes in length, the algorithm will have O(N) performance. By way of example, the source code for such a hash algorithm may be defined as follows:
Once the hash of the subset has been calculated one time in this manner, the hash can be “slid” to either the left or the right with O(1) performance. In other words, the previously calculated hash value can be used to determine the new hash value much more expeditiously than calling Hash(P+1, N). For this example, the following function may be used to “slide” the previously calculated hash value to the right by a single byte:
For the 16-bit checksum hash algorithm, the slide function subtracts the first byte of the previously hashed segment and then adds the final byte of the new segment that is under consideration.
The 16-bit checksum algorithm described above is considered a weak hash algorithm, in that; it commonly produces the same hash value for different inputs. This phenomenon is referred to in the art of computer programming as a hash collision. To lessen the likelihood of collisions occurring, a stronger hash algorithm may be implemented. For example, a 16-bit shifting XOR algorithm that begins with a non-zero seed value, such as 0×1357 will provide for stronger protection against hash collisions. For each byte of input data, the current value is rotated to the left one bit (the highest order bit is rotated to the lowest order position and all other bits are shifted to the left one position) and the input byte is bitwise XORed to this value, for example: *P++;
The 16-bit shifting XOR hash algorithm can be slid to the right one byte with O(1) performance using the following function: ROTL(0x1357, N + 1)
P[0], N)
P[N];
Similar to the 16-bit checksum example, the 16-bit shifting XOR slide function removes the effects of the first byte of the previously hashed segment and adds the effects of the final byte of the new segment under consideration. However, for the 16-bit shifting XOR algorithm it is also necessary to take into account the amount that the first byte, the final byte and the seed value have been shifted into the result. The 16-bit shifting XOR algorithm is used to generate the example values shown in
Referring again to
The validation process discussed above ensues, at step 542, where N′ is assigned to the hash list index of the hash list entry that matches the hash value of the suspect dataset. At step 544, a determination is made as to whether the hash value of the next S-sized suspect dataset segment, Hash 16(P+S, S) is equal to the hash value of the next S-sized source dataset segment, HL16 (N′+1). If a determination is made that the next segments do not have equivalent hash values then the routine returns to step 536, for a determination of the next match between the source dataset hash list entries and the hash value of the suspect dataset, X. If no further matches are determined the flow returns to steps 538 and 540 for further slide function processing or if matching is determined the flow returns to steps 542 and 544 for further validation of the match via next segment matching.
If a determination is made that the next segments of the suspect and source datasets do have equivalent hash values the, at step 546, the matching segments, i.e., the two consecutive segments, are copied from the suspect dataset to the target dataset. At steps 548 and 550 the next segment is considered for match by setting the N index in the source dataset hash list to N′+2 and setting the pointer in the suspect dataset to P+(2*S). At step 552, the N′th and the (N′+1)th hash list entries are marked as being matched and the flow returns to step 524, where a new hash value is determined for Hash16 (P, S).
It should be noted that for the sliding comparison flow illustrated in
Referring again to
At step, 418, the first 16-bit portion of each entry in the HL128 hash list is transmitted from the target host to the source host. Once the source host receives the HL128 hash list, at step 420, the source hosts calculates a 128-bit hash value for each segment of the source dataset that has not been requested by the target host. These hash values are then concatenated into a linear hash list, HL128. At step 422, the source host compares the first 16-bit portion of each entry in the source dataset HL128 with the first 16-bit portion of the target dataset HL128 to determine if matches exist. Any source dataset segments whose corresponding HL128 entry did not match are transmitted, at step 424, from the source host to the target host.
Once the target host receives the source dataset segments, at step 426, the segments are copied into their respective positions within the target dataset. At step 428, a 128-bit hash value is created for the entire target dataset and this hash value is compared to the hash value for the entire source dataset, Z. If it is determined that the hash values of the entire source and target datasets are equivalent then the transfer and reconciliation are deemed to have been successful and the process is completed. If, however, it is determined that the hash value of the entire target dataset and, Z, the hash value of the entire source dataset are not equal then, at step 430, a determination is made as to whether all 16-bit portions of the target dataset hash list have been transmitted to the source host. If it is determined that all 16-bit portions of the target dataset hash list have been sent and the dataset hash values are not equal, the transfer and reconciliation process is ended unsuccessfully. If it is determined that not all of the 16-bit portions of the target dataset hash list have been sent then, at step 432, the target host transmits to the source host the next 16 bits of each entry in the target dataset HL128 hash list.
Once the source host receives the next 16 bits of each entry in the target dataset HL128, at step 422, the source host compares the next 16 bits of each entry in the target dataset HL128 with the next 16 bits in the source dataset HL128 to determine if matches exist. Any source dataset segments whose corresponding HL128 entry did not match are transmitted, at step 424, from the source host to the target host. This process continues until the hash values of the entire source dataset and target dataset are equal representing successful transfer and reconciliation or until all of the 16 bit portions of the target dataset HL 128 have been transmitted and the hash values of the entire source and target datasets have been determined to not be equal, thus, signifying an unsuccessful transfer and reconciliation process.
Therefore, the present invention provides for an improved method and system for expedited data transfer and data reconciliation. The method and systems of the present invention can effectively and efficiently perform data transfer and data reconciliation between data files existing on separate hosts. The resulting efficient transfer of data significantly reduces the time required to transfer updates and limits the number of transferring resources required to perform the transfer operation. The method and system is capable of effectively isolating the data that has been revised, updated, added or deleted in order to limit the data that is transferred from the source host to the secondary host. In addition the system provides for data reconciliation in those applications in which the neither host is aware of the revision that exists on the other host.
Many modifications and other embodiments of the invention will come to mind to one skilled in the art to which this invention pertains having the benefit of the teachings presented in the foregoing descriptions and the associated drawings. Therefore, it is to be understood that the invention is not to be limited to the specific embodiments disclosed and that modifications and other embodiments are intended to be included within the scope of the appended claims. Although specific terms are employed herein, they are used in a generic and descriptive sense only and not for purposes of limiting the scope of the present invention in any way.