The present invention relates to the field of management of image data in data storage. In particular, the present invention relates to a method and device for automatic detection of duplicate images in data storage and corresponding device, which method and device are particularly efficient with regard to the automatic management of large amounts of dispersed image data.
The proliferation of digital devices that comprise a photo camera has favored an explosion of the volume of image data stored by a user, and it is quite easy for a user to end up with many image duplicates in the user's image library.
This situation can be even worse in the case of a home network environment, where several users can add images to an image library, the library possibly being physically distributed on several, dispersed storage devices, for example on hard drives of different PCs, on a NAS (Network Attached Storage), on USB keys, etc.).
The reasons why an image library can end up by containing many duplicate images are diverse. Unintentional duplicate images are produced through copy actions. For example, a user who organizes photos in different directories does not move the photos, which would have been appropriate, but unintentionally rather copies them; a user that wishes to transfer photos via e-mail adapts the photo resolution for including them in his e-mail but unintentionally keeps the low-resolution copies; a user that views images with a viewer application modifies these by rotation, or modification of color and contrast and unintentionally keeps the unmodified copy in addition to the modified copy. Other copy actions are intentional and are due to the fact that the user has no longer an overview of the data that he has stored, a situation that is getting worse when the user has multiple storage devices and many images, and gets even worse when multiple users add and copy data to the multitude of images stored. The user, knowing that he does not have a clear overview of the images stored, worsens this situation by preferring finally to copy rather than to move or replace images, by fear of deleting them. This creates a situation where the user no longer knows which images are disposable copies and which are not.
In all these scenarios, a duplicate detection tool can be necessary, or at least useful, to assist the user with the cleanup or management tasks of the user's image library.
Prior-art detection of image duplicates detects duplicates according to criteria such as checksum data, creation data, file name, file size, and image format. Such criteria allows only detection of identical copies of an original image but not the copies that have been slightly or largely modified in order to enhance the visual perception of the image on a particular display. Moreover, if more than one criterion for the duplicates detection is specified, duplicates are detected that comply with any of the selected criteria and user intervention is needed to determine if the user wishes to delete the detected duplicates from the image library. Other duplicate detection methods are capable to detect near-duplicate images by comparing image pixel data. A user is required to specify a matching percentage of pixel data of two images to mark and detect an image as being a duplicate image. The detection then detects these near-duplicates as it detects strictly identical images without distinction.
Thus, prior art solutions can still be optimized with regard to detection of duplicate images in data storage. Notably, a method is needed that reduces user intervention to the strictly needed.
The invention reduces the complexity of maintaining a collection of images.
When near duplicates are found according to the method of the invention, associated metadata resumes the reason why these are considered as being duplicates (exact, near, far), for example low resolution copy, zoom, copy stored in a backup zone, etc. A set of management action rules that are associated to this metadata then allows managing the image library automatically according to these management action rules. Automatic management actions are taken for images which are considered as corresponding by the method of the invention based on the discussed metadata and associated rules. These management actions are for example delete, keep, or replace-by-link to the original image. The latter option can be required when the existence of identical copies is to be avoided for reasons of efficiency.
The discussed advantages and other advantages not mentioned here, that make the device and method of the invention advantageously well suited for automatic management of a collection of images, will become clear through the detailed description of the invention that follows.
In order to automatically manage a collection of images, the invention proposes a method comprising a step of detection of correspondence between a first image and at least a second image in the collection of images according to at least one criterion for correspondence between the first image and the at least one second image, and a step of association of metadata to the at least a second image when the correspondence is detected, the metadata being representative of a relation between the first image and the at least one second image and comprising the at least one criterion for correspondence between the first image and the at least one second image which has lead to the detection of correspondence.
According to a variant embodiment of the invention, the method further comprises a step of automatic determination and application of one of a set of predetermined actions for processing of the at least one second image according to the associated metadata.
According to a variant embodiment of the invention, the metadata further comprises information representative of a level of correspondence between the first image and the at least one second image.
According to a variant embodiment of the invention, the set of predetermined actions comprise an action of replacement of the at least one second image with a link to the first image.
According to a variant embodiment of the invention, the set of predetermined actions comprise an action of deletion of the at least one second image.
According to a variant embodiment of the invention, the set of predetermined actions comprise an action of transferring the at least one second image to a storage.
According to a variant embodiment of the invention, the set of predetermined actions comprise an action of renaming the at least one second image.
The invention also concerns a device for automatic management of a collection of images, the device comprising means for detection of correspondence between a first image and at least a second image in the collection of images according to at least one criterion for correspondence between the first image and the at least one second image, and means for association of metadata to the at least one second image when the correspondence is detected, the metadata being representative of a relation between the first image and the at least one second image and comprising the at least one criterion for correspondence between the first image and the at least one second image which has lead to the detection of correspondence.
More advantages of the invention will appear through the description of particular, non-restricting embodiments of the invention. The embodiments will be described with reference to the following figures:
In a first initialization step 100, variables are initialized for the functioning of the method. When the method is implemented in a device such as device 400 of
According to the described embodiment, the images in the data storage are first completely processed by the detection method, before being processed by the automatic determination and application of predetermined actions. According to a variant embodiment, the latter automatic determination is done immediately following the discussed association of metadata. This variant has an advantage in processing time because the mass of images to be processed by the detection method can be reduced; each time when images that are deleted by delete actions the mass of images to process is reduced. An intelligent selection method for first (“a”) and second “b” images can further reduce the processing time needed. For example, a step is added to the method that excludes detection of correspondence of two images that have already been passed through the detection process. According to this variant, the detection method associates metadata to each image that has been completely processed by the detection process which metadata indicates that the image has already been processed as a first “a” image, and in each next iteration of the detection method where a next first “a” image is selected, already processed first “a” images are not being processed in the detection method again, i.e. they are not selected as second (“b”) images.
In a first test step 200, it is determined if a checksum calculated over the first (“a”) image is the same as a checksum calculated over the second (“b”) image. Checksum calculation is done through known methods, such as SHA (Secure Hash Algorithm) or MD5 (Message Digest 5). If the calculated checksum is the same, the two images are considered as being identical and a decisional step 201 is done, in which it is determined if the location where the second (“b”) image is stored is a location for storage of backup. If so, metadata is added in step 203 to the identical second (“b”) image that indicates that the second image is a backup copy of the first image. If not, metadata is added in step 202 to the identical second image that indicates that the second image is an identical copy. As will be handled further on, it is possible to automatically delete identical images that are not backup copies by execution of actions associated to metadata. If, as an outcome of test step 200 it is on the contrary determined that the checksums of the first and the second images are different, a test step 204 is executed, in which it is determined if a normalized distance d between fingerprints of the first “a” image fp(a) and of the second “b” image fp(b) is below a first threshold th2 d(fp(a),fp(b)) <th2; th2 is a threshold that is chosen such that if d(fp(a),fp(b)) <th2, the second image “b” can be considered as being a modified copy of the first image “a”. If d(fp(a),fp(b)) is not inferior to th2, the first and the second images are considered as being different by the method of the invention and the method continues with step 109. But if d(fp(a),fp(b)) is inferior to th2, we are dealing with a modified copy and it can be determined in following steps how the difference between the two images can be characterized. Notably, in a next step 205, the previously calculated normalized fingerprint distance is compared with a next threshold th1. If d(fp(a),fp(b)) is superior to th1, the second image “b” is characterized in a step 206 as being a largely modified copy of the first image “a” and corresponding metadata is associated to the second image for example according to table 1, first row (LMC, <path>/a). If on the contrary d(fp(a),fp(b)) is inferior to th1, a test step 207 is executed, in which it is verified if the first image (“a”) has the same resolution as the second image “b”. Image resolution can be compared based on prior-art file metadata that is present in prior-art file systems, such as EXIF (Exchangeable Image File Format). If the image resolutions differ, a step 208 is executed in which metadata is associated to the second image that indicates that the second image is a different resolution copy of the first image; e.g. a tag ‘DRC’ is added to metadata associated to image b together with the storage path of image a: (DRC, <path>/a). If on the contrary the resolution of the first image differs from that of the second image, a next test step 209 is executed, in which the encoding methods of the two images are compared. This comparison is done according to known methods as for example by comparing file extensions (e.g. *.jpg, *.tiff). If the two images are encoded with a different encoding method, a step 210 is executed in which corresponding metadata is associated to the second image, e.g. a tag ‘DEC’ is added to image b together with the storage path of image a: (DEC, <path>/a). If on the contrary the two images are encoded with different encoding methods, step 211 is executed in which metadata (SMC, <path/a>) is associated to the second image. After steps 202, 203, 206, 208, 210 and 211, step 109 is executed, returning to
In a first step 300, a next second image is chosen (“b” image). Its associated metadata is read in step 301 and in a step 302 an action is determined for the associated metadata, for example, according to the actions as defined in table 3. In a test 303, it is determined if the action associated to the metadata is the creation of a file link. If so, a file link is created in a step 306, from the second image to the first image. The metadata remains associated to the link, so that for future iterations of the method of the invention, a trace is kept. If the action is not a create link, it is verified in a test 304 if the action is a delete image; if so, the second image is deleted in a step 307. If the action is not a delete image action neither, it is verified in a test 305 if the action is an ask action, and if so, the second image is transferred to a temporary storage in a step 308, where images are stored for which a user decision is needed. If not, the action steps are repeated with a selection of a next second image in step 300. This is also the case after steps 306, 307 and 308. The processing ends when all images have been processed.
Variant embodiments of the discussed application of actions are possible. Notably the steps of the method can be executed in a different order; more or less actions (and thus tests) can be added/removed.
The method of the invention can be applied as a background task or as a clean-up tool that is more or less regularly executed. The method can be enhanced with a monitoring feature that monitors creation, deletion and copying of images so as to keep the metadata updated as soon as a creation, deletion or copying is executed.
Table 2 hereunder illustrates an example lookup table for looking up actions that are associated to a tag type. The tags types are those of the example implementation illustrated by means of
When the second image has an associated metadata tag ‘DEC’, the associated action is to delete the second image only if the first image is of the ‘png’ encoding type. When the second image has an associated tag ‘SMC’, the associated action is to ask the user to decide what to do. According to a variant embodiment of the invention, images with associated action ‘Ask’ are grouped in temporary storage and the user is only bothered once for a review of all images in this temporary storage with associated action ‘Ask’ for which the user's decision is required. Such a review can for example be done through a visual presentation of the corresponding first and second images image pair and with a possibility for un-checking a ‘keep’ checkbox related to each second image of the image pair.
According to a variant embodiment of the invention, multiple metadata tags can be associated to a single image. For example, a same image can have both DRC and DEC tags, meaning that the image is a different resolution copy but also a different encoding copy. In this case, the steps of the method are not executed as depicted in
According to a variant embodiment of the invention, the actions are user-configurable.
Where ∥.∥ represents an L2 norm of a vector, i.e. its Euclidian distance.
An image fingerprint, constructed according to known prior-art methods, can be represented as an n-dimensional vector. “n” can have a value of hundred or even thousand. In our example and for simplicity of illustration, we assume that n=2. The center of
It is noted that the word “register” used in the description of memories 510 and 520 designates in each of the mentioned memories, a low-capacity memory zone capable of storing some binary data, as well as a high-capacity memory zone, capable of storing an executable program, or a whole data set.
Processing unit 511 can be implemented as a microprocessor, a custom chip, a dedicated (micro-) controller, and so on. Non-volatile memory NVM 510 can be implemented in any form of non-volatile memory, such as a hard disk, non-volatile random-access memory, EPROM (Erasable Programmable ROM), and so on.
The Non-volatile memory NVM 510 comprises notably a register 5201 that holds a program representing an executable program comprising the method according to the invention. When powered up, the processing unit 511 loads the instructions comprised in NVM register 5101, copies them to VM register 5201, and executes them.
The VM memory 520 comprises notably:
A device such as device 500 is suited for implementing the method of the invention of automatic management of a collection of images, the device comprising
means for detection (CPU 511, VM register 5205) of correspondence between a first image and a second image(s) in said collection of images according to a criterion(s) for correspondence between said first image and the second image(s);
means for association of metadata (CPU 511, register 5206) to the second image(s) when said correspondence is detected, the metadata being representative of a relation between the first image and the second image(s) and comprising the criterion(s) for correspondence between the first image and the second image(s) which has lead to the detection of the correspondence.
Other device architectures than illustrated by
Number | Date | Country | Kind |
---|---|---|---|
11306284.8 | Oct 2011 | EP | regional |
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/EP2012/068909 | 9/26/2012 | WO | 00 | 4/2/2014 |