Claims
- 1. A semantic hashing method in a file system, the method comprising:
determining first semantic information for a first file; selecting a base file using the first semantic information and base semantic information, wherein the base semantic information is semantic information for the base file; and computing a diff between the first file and the base file.
- 2. The method of claim 1, further comprising storing the diff, wherein the diff is used to generate the first file at a later time.
- 3. The method of claim 2, further comprising generating the first file from the stored diff in response to receiving a read request for the first file.
- 4. The method of claim 1, wherein the step of determining first semantic information further comprises:
extracting semantic information from the first file, the extracted semantic information including predetermined features of the first file.
- 5. The method of claim 1, wherein the step of selecting a base file comprises steps of:
identifying multiple files in the file system having semantic information similar to the first semantic information; computing a diff for each of the multiple files using the first file; and selecting a file of the multiple files having a smallest diff.
- 6. The method of claim 5, wherein the step of selecting a file comprises steps of:
comparing the diff of the selected file to a threshold; selecting the file to be the base file in response to the diff being greater than the threshold; and identifying another set of multiple files from the file system for selecting a base file in response to the diff being less than the threshold.
- 7. The method of claim 1, wherein the step of selecting a base file comprises steps of:
identifying multiple files in the file system having semantic information similar to the first semantic information; performing block-level hashing for each of the multiple files and the first file; and selecting a file of the multiple files having a most number of similar blocks to the first file.
- 8. The method of claim 1, wherein the step of computing the diff comprises steps of:
selecting a diff function associated with the type of the first file; and computing the diff using the selected diff function.
- 9. An apparatus in a file system comprising:
means for determining first semantic information for a first file; means for selecting a base file using the first semantic information and base semantic information, wherein the base semantic information is semantic information for the base file; and means for computing a diff between the first file and the base file.
- 10. The apparatus of claim 9, further comprising means for storing the diff, wherein the diff is used to generate the first file at a later time.
- 11. The apparatus of claim 10, further comprising means for generating the first file from the stored diff in response to receiving a read request for the first file.
- 12. The apparatus of claim 9, wherein the means for determining first semantic information further comprises means for extracting semantic information from the first file, the extracted semantic information including predetermined features of the first file.
- 13. The apparatus of claim 9, wherein the means for selecting a base file comprises:
means for identifying multiple files in the file system having semantic information similar to the first semantic information; means for computing a diff for each of the multiple files using the first file; and means for selecting a file of the multiple files having a smallest diff.
- 14. The apparatus of claim 13, wherein the means for selecting a file comprises:
means for comparing the diff of the selected file to a threshold; means for selecting the file to be the base file in response to the diff being greater than the threshold; and means for identifying another set of multiple files from the file system for selecting a base file in response to the diff being less than the threshold.
- 15. The apparatus of claim 9, wherein the means for selecting a base file comprises:
means for identifying multiple files in the file system having semantic information similar to the first semantic information; means for performing block-level hashing for each of the multiple files and the first file; and means for selecting a file of the multiple files having a most number of similar blocks to the first file.
- 16. The apparatus of claim 9, wherein the means for computing the diff comprises:
means for selecting a diff function associated with the type of the first file; and means for computing the diff using the selected diff function.
- 17. A distributed file system comprising:
a plurality of nodes storing objects, wherein at least one of the objects is a version of a base object; one of the plurality of nodes being operable to store a diff for the version and the base object in one of the plurality of nodes, wherein the base object is semantically close to the version; at least one extractor operable to extract semantic information for one or more of the objects; and a semantic catalogue stored in the file system, the semantic catalogue comprising semantic information for the objects.
- 18. The distributed file system of claim 17, wherein the distributed file system is operable to search the semantic information in the semantic catalogue to identify the base object.
- 19. The distributed file system of claim 18, wherein the semantic information is semantic vectors for the objects, wherein each semantic vector identifies predetermined features for an associated object.
- 20. The distributed file system of claim 15, wherein the semantically close base object has a semantic vector similar to a semantic vector for the version.
- 21. The distributed file system of claim 17, wherein the distributed file system is overlaid on a peer-to-peer network comprising the plurality of nodes.
- 22. The distributed file system of claim 21, further comprising a distributed archive file system operable to store a plurality of versions of the objects.
- 23. The distributed file system of claim 17, wherein the semantic catalogue is a distributed index stored on the plurality of nodes.
- 24. The distributed file system of claim 17, wherein the diff is data associated with differences between the base object and the version.
- 25. A node in a semantic-based distributed file system, the node comprising:
a processor; and at least one storage device storing objects; and a semantic catalogue containing semantic information for the objects, wherein the processor is operable to compute a diff between a base object in the file system and a new version of the base object for storage in the file system, the base object being semantically close to the new version.
CROSS-REFERENCE
[0001] The present invention is related to pending:
[0002] U.S. application Ser. No. ______, (Attorney Docket No. 200207181-1) filed herewith, and entitled “SEMANTC FILE SYSTEM”, by Xu et al.; and
[0003] U.S. application Ser. No. ______, (Attorney Docket No. 200207183-1) filed herewith, and entitled “SNAPSHOT OF A FILE SYSTEM” by Mahalingam et al.; which are all assigned to the assignee and are incorporated by reference herein in their entirety.