The present disclosure generally relates to digital pathology and more specifically to the deduplication of sample images.
In digital pathology, whole slide images of tissue samples can be used for computational processes. In many instances, sets of the whole slide images contain either identical or similar copies of digital or physical samples. Such copies are generally removed before the set of whole slide images is used during the computational processes. However, removing the copies can be cumbersome, inefficient, and take a long time. Further complicating the removal of the copies, it can be difficult to identify which images from the set of images are copies. For example, many images may be associated with mismatched information, have incorrect labels, and/or have missing information. Many images may also be similar to one another, but not identical. Thus, it can be difficult to easily and quickly identify the copies so that they can be removed for use in the computational processes.
Methods, systems, and articles of manufacture, including computer program products, are provided for deduplication of sample images. In one aspect, there is provided a system. The system may include at least one processor and at least one memory. The at least one memory may store instructions that result in operations when executed by the at least one processor. The operations may include: identifying, based at least on an identification information associated with a plurality of images depicting one or more slices of tissue from at least one block of a tissue sample, one or more of a duplicate image, a replicate image, and a multiple image present within the plurality of images; identifying, based at least on a metric computed for one or more pairs of images from the plurality of images, the one or more of the duplicate image, the replicate image, and the multiple image present within the plurality of images, the metric indicative of a minimal difference achieved between the each pair of images while adjusting an alignment therebetween; and updating the identification information associated with the plurality of images to indicate the one or more of the duplicate image, the replicate image, and the multiple image identified as present within the plurality of images.
In another aspect, there is provided a method for identifying duplicate images. The method may include: identifying, based at least on an identification information associated with a plurality of images depicting one or more slices of tissue from at least one block of a tissue sample, one or more of a duplicate image, a replicate image, and a multiple image present within the plurality of images; identifying, based at least on a metric computed for one or more pairs of images from the plurality of images, the one or more of the duplicate image, the replicate image, and the multiple image present within the plurality of images, the metric indicative of a minimal difference achieved between the each pair of images while adjusting an alignment therebetween; and updating the identification information associated with the plurality of images to indicate the one or more of the duplicate image, the replicate image, and the multiple image identified as present within the plurality of images.
In another aspect, there is provided a computer program product that includes a non-transitory computer readable storage medium. The non-transitory computer-readable storage medium may include program code that causes operations when executed by at least one processor. The operations may include: identifying, based at least on an identification information associated with a plurality of images depicting one or more slices of tissue from at least one block of a tissue sample, one or more of a duplicate image, a replicate image, and a multiple image present within the plurality of images; identifying, based at least on a metric computed for one or more pairs of images from the plurality of images, the one or more of the duplicate image, the replicate image, and the multiple image present within the plurality of images, the metric indicative of a minimal difference achieved between the each pair of images while adjusting an alignment therebetween; and updating the identification information associated with the plurality of images to indicate the one or more of the duplicate image, the replicate image, and the multiple image identified as present within the plurality of images.
In another aspect, there is provided an apparatus that includes: means for identifying, based at least on an identification information associated with a plurality of images depicting one or more slices of tissue from at least one block of a tissue sample, one or more of a duplicate image, a replicate image, and a multiple image present within the plurality of images; means for identifying, based at least on a metric computed for one or more pairs of images from the plurality of images, the one or more of the duplicate image, the replicate image, and the multiple image present within the plurality of images, the metric indicative of a minimal difference achieved between the each pair of images while adjusting an alignment therebetween; and means for updating the identification information associated with the plurality of images to indicate the one or more of the duplicate image, the replicate image, and the multiple image identified as present within the plurality of images.
In some variations of the methods, systems, and non-transitory computer readable media, one or more of the following features can optionally be included in any feasible combination.
In some variations, the identification information may include at least a portion of a manifest, a scannable barcode, and/or metadata associated with the plurality of images.
In some variations, the identification information may include one or more of a sample identifier, a patient identifier, a block identifier, a slide identifier, an imaging modality, and a scanning parameter.
In some variations, a first image of the plurality of images may be identified as a duplicate of a second image of the plurality of images based at least on the first image and the second image being associated with matching sample identifiers, patient identifiers, block identifiers, and slide identifiers.
In some variations, a first image of the plurality of images may be identified as a replicate of a second image of the plurality of images based at least on the first image and the second image being associated with matching sample identifiers, patient identifiers, and block identifiers but different slide identifiers.
In some variations, a first image of the plurality of images may be identified as a multiple of a second image of the plurality of images based at least on the first image and the second image being associated with matching sample identifiers and patient identifiers but different block identifiers and different slide identifiers.
In some variations, the identification information may be updated by at least including, in the identification information, a flag indicating the one or more of the duplicate image, the replicate image, and the multiple image present within the plurality of images.
In some variations, the minimal difference between each pair of images may correspond to a minimal difference between a first vector field of a first image included in each pair of images and a second vector field of a second image included in each pair images.
In some variations, the first vector field and the second vector field may be generated. The difference between the first vector field and the second vector field may be determined while adjusting the alignment between the first image and the second image until achieving the minimal difference between the first vector field and the second vector field.
In some variations, each of the first vector field and the second vector field may include a plurality of vectors. Each vector of the plurality of vectors may correspond to a section within a corresponding image. Each vector of the plurality of vectors may have a direction and a magnitude corresponding to a change in an intensity values of one or more pixels included a corresponding section of the corresponding image.
In some variations, each of the first image and the second image may be divided into a plurality of sections. A size of each section of the plurality of sections and/or a quantity of the plurality of sections may be adjusted until the minimal difference between the first vector field and the second vector field is achieved.
In some variations, the alignment between the first image and the second image may be adjusted by at least translating, rotating, and/or inverting the first image relative to the second image.
In some variations, the difference between the first vector field and the second vector field may be determined by at least subtracting one or more x-components of the first vector field and the second vector field separately from one or more y-components of the first vector field and the second vector field.
In some variations, a first image in each pair of images may be identified as a duplicate of a second image in each pair of images based at least on the metric satisfying a first threshold.
In some variations, the first image may be identified as a replicate of the second image based at least on the metric failing to satisfy the first threshold but satisfying a second threshold greater than the first threshold.
In some variations, the first image and the second image may be identified as adjacent images from a same block of the tissue sample based at least on the metric satisfying the second threshold. The identification information associated with the plurality of images may be updated to indicate the first image and the second image as adjacent images from the same block of the tissue sample.
In some variations, the first image and the second image may be converted into grayscale images prior to generating the first vector field and the second vector field.
In some variations, a scale of pixel intensity values in each of the first image and the second image may be converted from a first scale to a second scale prior to generating the first vector field and the second vector field.
In some variations, the first scale may be [0, 1] and the second scale may be [−1, 1].
In some variations, the one or more of the duplicate image, the replicate image, and the multiple image present within the plurality of images may be first identified based on the identification information before the metric is computed for a remaining plurality of images to identify the one or more of the duplicate image, the replicate images, and the multiple images in the remaining plurality of images.
In some variations, at least one duplicate image present within the plurality of images may be deleted.
In some variations, the identification information may be updated to correct at least one discrepancy between the identification information and the one or more of the duplicate image, the replicate image, and the multiple image identified based on the metric.
In some variations, a sequence of the plurality of images indicated in the identification information may be updated to restore, based at least on an ordering of at least one replicate image identified within the plurality of images, an original sequence of images within the at least one block of the tissue sample.
In some variations, a sequence of the plurality of images indicated in the identification information may be updated to restore, based at least on an ordering of at least one multiple image identified within the plurality of images, an original sequence of images across different blocks of the tissue sample.
In some variations, the duplicate image may depict a same slice of tissue as another image included in the plurality of images. The replicate image may depict a different slice of tissue from a same block of the tissue sample as the another image included in the plurality of images. The multiple image may depict a slice of tissue from a different block of the tissue sample as the another image from the plurality of images.
Implementations of the current subject matter can include methods consistent with the descriptions provided herein as well as articles that comprise a tangibly embodied machine-readable medium operable to cause one or more machines (e.g., computers, etc.) to result in operations implementing one or more of the described features. Similarly, computer systems are also described that may include one or more processors and one or more memories coupled to the one or more processors. A memory, which can include a non-transitory computer-readable or machine-readable storage medium, may include, encode, store, or the like one or more programs that cause one or more processors to perform one or more of the operations described herein. Computer implemented methods consistent with one or more implementations of the current subject matter can be implemented by one or more processors residing in a single computing system or multiple computing systems. Such multiple computing systems can be connected and can exchange data and/or commands or other instructions or the like via one or more connections, including, for example, to a connection over a network (e.g. the Internet, a wireless wide area network, a local area network, a wide area network, a wired network, or the like), via a direct connection between one or more of the multiple computing systems, etc.
The details of one or more variations of the subject matter described herein are set forth in the accompanying drawings and the description below. Other features and advantages of the subject matter described herein will be apparent from the description and drawings, and from the claims. While certain features of the currently disclosed subject matter are described for illustrative purposes in relation to whole slide images, it should be readily understood that such features are not intended to be limiting. The claims that follow this disclosure are intended to define the scope of the protected subject matter.
The accompanying drawings, which are incorporated in and constitute a part of this specification, show certain aspects of the subject matter disclosed herein and, together with the description, help explain some of the principles associated with the disclosed implementations. In the drawings,
When practical, like labels are used to refer to same or similar items in the drawings.
In digital pathology, whole slide images of tissue slides can be used for computational processes.
The whole slide images can be generated by capturing and storing images of a whole slide, such as the tissue slide 204. As noted above, sets of the whole slide images can be used for different computational purposes, such as in machine learning, artificial intelligence, or other systems, or computational processes. However, the sets of whole slide images may contain duplicates, replicates, or multiples, reducing the accuracy and efficiency of the computational processes.
Consistent with implementations of the current subject matter, duplicate images are images of the same tissue slides that are from the same block of a tissue specimen of a patient. Duplicate images may include identical copies or images that include identical or close to identical (e.g., captured from a different focal point, modality, etc.). Consistent with implementations of the current subject matter, replicate images are images of tissue slides including tissue from the same block, but are different slices of the tissue. Replicate images may include serial sections or slices related to one another that are similar in appearance, but are not the same. Replicate images may also include slices that are intact or disrupted (e.g., cut) contiguous samples, missing tissue, have floating tissue, or the like. Consistent with implementations of the current subject matter, multiples are images of tissue slides that contain tissue samples from the same patient, but are from different blocks. Images that are duplicates, replicates, and/or multiples can be related to one another such that they are not considered to represent independent data points during the computational processes. Inclusion of duplicate images, replicate images, and/or multiples in a set of whole slide images for use with computational processes can be detrimental to the performance of those computational processes.
To improve the performance of the computational processes, duplicate images, replicate images, and/or multiples may be removed from the set of whole slide images can be removed. However, removing the copies can be cumbersome, inefficient, and take a long time. Further complicating the removal of the copies, it can be difficult to identify which images from the set of images are copies. For example, many images may be associated with mismatched information, have incorrect labels, and/or have missing information. Many images may also be similar to one another, but not identical. For example,
The deduplication system consistent with implementations of the current subject matter may identify duplicates, replicates, and/or multiples in sets of whole slide images for removal from the image data sets using an automated tool and/or image processing techniques that are computationally cheaper, quicker, and more efficient than conventional techniques. For example, the deduplication system includes a deduplication engine that compares information, such as data related to the images stored in a manifest and/or data embedded in the images, and based on the information, identifies the duplicates, replicates, and/or multiples. The deduplication engine may also generate a metric based on a comparison of remaining images not previously identified as a duplicate, replicate, and/or multiple, and based on the metric, predict whether the remaining images are duplicates, replicates, and/or multiples. Based on the identification and/or the prediction, the duplicate, replicate, and/or multiple images may be removed from the set of the whole slide images, such as for use in the computational processes.
In some example embodiments, the deduplication engine 110 may be configured to identify, within an image set 115, one or more of a duplicate image, a replicate image, and a multiple image. The image set 115 may include a plurality of images depicting one or more slices of tissue from at least one block of a tissue sample. As used herein, the term “duplicate image” may refer to an image depicting a same slice of tissue as another image included in the image set 115. The term “replicate image” may refer to an image depicting a different slice of tissue from a same block of the tissue sample as the another image included in the image set 115. The term “multiple image” may refer to an image depicting a slice of tissue from a different block of the tissue sample as the another image from the image set 115. In the example shown in
The database 120 may store identification information 125 associated with the image set 115. The identification information 125 may include a manifest 510, an example of which is shown in
Referring again to
For example, the deduplication engine 110 may use autology techniques to standardize the identification information 125 associated with the image set 115. Autology techniques may include tools for curating large metadata sets, such as the identification information 125 associated with the image set 115, while arbitrating discrepancies within the identification information 125. Using the autology techniques described herein, the deduplication engine 110 may accelerate and/or improve curation of the identification information 125. For example, two images of a same tissue sample originating from a same organ may refer to the organ by different names. Accordingly, some images of the same tissue samples may include identifiers having different naming conventions. The deduplication engine 110 may standardize the identification information 125, such as the identifiers contained therein, to enable a subsequent comparison of the identification information 125.
In some implementations, the autology techniques used by the deduplication engine 110 may include using a Levenshtein distance to find similar terms that correspond to a same tissue sample, such as a same tissue sample identifier. The Levenshtein distance is a metric that measures a distance between two sequences of words. For example, the Levenshtein distance may include a Levenshtein ratio, partial ratio, token sort ratio, token set ratio, or the like. The Levenshtein ratio, partial ratio, token set ratio, and/or token sort ratio indicates a similarity between the two sequences, such as two identifiers of the information associated with the image set 115. The deduplication engine 110 may standardize the identification information 125 associated with the image set 115 by, for example, adjusting two identifiers to match when the two identifiers have a Levenshtein distance that meets a threshold. Standardizing the identification information 125 associated with the image set 115 can help improve performance of the deduplication engine 110 in identifying replicates, duplicates, and/or multiples within the image set 115.
Referring again to
In some cases, where the metadata 520 of the image set 115 includes the barcode associated with each corresponding tissue sample, the barcode may encode at least a portion of the information associated with each individual image and/or the tissue sample depicted therein. Alternatively and/or additionally, at least a portion of the identification information 125 associated with the image and/or the tissue sample depicted in the image may be retrieve, for example, from the database 120, based on a corresponding resource identifier (e.g., a uniform resource locator (URL) of the database 120).
To compare the identification information 125, the deduplication engine 110 may be implemented as a tool, such as an automated tool. The tool may be run as a command line tool running on Python 3, among other systems. As an example, the client device 130 may receive an input including a selection of a sample identifier, a patient identifier, or the like. The input may additionally or alternatively include a file format of the image set 115, a name of the sample identifier 514 in the manifest 510, a name of the block identifier 512 in the manifest 510, an input directory or directories including the image set 115 and corresponding manifest 510, an output directory for saving the results, or the like. The tool may read image files and/or manifests, such as the manifest 510.
In response to the first image being identified as a multiple, replicate, and/or duplicate of the second image based on the comparison of the identification information 125 associated with each of the first image and the second image, the deduplication engine 110 may update the identification information 125 to indicate the first image as a multiple, replicate, and/or duplicate (e.g., by including one or more corresponding flags in the manifest 510). In cases where the first image is determined to be a duplicate of the second image (e.g., depicting a same slice of tissue), the first image may be removed from the image set 115. It should be appreciated that the deduplication engine 110 may continue to deduplicate, based on the identification information 125, the remaining images in the image set 115 and make corresponding updates to the identification information 125. Moreover, in some example embodiments, the deduplication engine 110 may also deduplicate the image set 115 based on a metric indicative of a minimal difference achieved between the each pair of images in the image set 115 while adjusting an alignment therebetween. In some cases, the metric-based deduplication may be performed in order to identify duplicate images, replicate images, and/or multiple images that the deduplication engine 110 failed to detect based on the identification information 125. Alternatively and/or additionally, the deduplication engine 110 may perform metric-based deduplication to correct any errors and/or discrepancies that may be present within the identification information 125 such as a mislabeling of duplicate images depicting the same slice of tissue, replicate images depicting adjacent slices of tissue from the same block of a tissue sample, and/or multiple images depicting slices of tissue from adjacent blocks of the tissue sample.
In some example embodiments, the deduplication engine 110 may generate the metric based on a comparison of the vector fields present in one or more pairs of images in the image set 125. Doing so may prevent the deduplication and ordering of the images from being compromised due to misalignments between the images. Accordingly, the comparison of the vector fields present in each pair of images may be performed while aligning the images to achieve a minimal difference in the vector fields therebetween. Alignment in this case may include identifying the center of each image and aligning the images based on the center of the images. Once the images are centered, alignment may further include reorienting the images to correct for any inadvertent rotations (e.g., clockwise or counter clockwise) and inversions of the images, which may occur during the scanning of the images.
Where the difference between the vector fields of two aligned images satisfy a first threshold (e.g., a first value x that is near zero), the deduplication engine 110 may determine that the two images are duplicates of one another. Alternatively, where the difference between the vector fields of the two images fails to satisfy the first threshold but satisfies a second threshold that is greater than the first threshold (e.g., a second value y that is greater than the first value x), the deduplication engine 110 may determine that the two images are replicates and/or multiples. In some cases, where the difference between the vector fields of the two images satisfies the second threshold, the deduplication engine 110 may determine that the two images are images of adjacent slices of tissue from the same block of the tissue sample or images from two adjacent blocks of the tissue sample.
As used herein, the vector field of an image may include one or more vectors, each of which corresponding to a section of the image and having a direction and a magnitude (e.g. length) corresponding to a change in the intensity values of the pixels (e.g., from black to white) included in the sections of the image. To further illustrate,
In some example embodiments, the deduplication engine 110 may convert the scale of pixel intensity from the first scale [0,1] to the second scale [−1,1]. In cases where the first image 800 and the second image 850 are color images, the deduplication engine 110 may choose the first scale [0,1] or the second scale [−1,1] for each of the red, green, and/or blue components of the images. Nevertheless, it should be appreciated that either the first scale [0,1] or the second scale [−1,1] may be used for determining the vector fields of each of the first image 800 and the second image 850.
Although either the first scale [0,1] or the second scale [−1,1] may be used, converting the scale of pixel intensity from the first scale [0,1] to the second scale [−1,1] may change the directionality of the resulting vector fields. To further illustrate,
In some example embodiments, the pixel gradient for the quadrant 825 may be calculated separately for horizontal components and vertical components of the vectors associated with each constituent pixels in the quadrant 825. For example, in some cases, the pixel gradient Ex along the x-axis of the quadrant 825 may be determined by applying Equation (1) below whereas the pixel gradient Ey along the y-axis of the quadrant 825 may be determined by applying Equation (2).
wherein q denotes the intensity value of each pixel, dx denotes a change in distance along the x axis, and dy denotes a change in distance along the y axis. It should be appreciated that the values of dx and dy may be positive or negative depending on where the pixel is located relative to the center of the quadrant 825.
The magnitude of the vector at the center of each portion of the first image 800, such as the quadrant 825, may be determined by applying Equation (3) below to the individual x and y components.
As noted, converting the intensity values of the pixels in the first image 800 from the first scale [0,1] to the second scale [−1,1] may change the directionality of the resulting vector fields, for example, from pointing in the direction of the lighter (or lower intensity) pixels to pointing in the direction of the darker (or higher intensity) pixels in the first image 800. This phenomenon is shown in
In some example embodiments, the deduplication engine 110 may determine, based on a first vector field of the first image 800 and a second vector field of the second image 850, whether the first image 800 and the second image 850 are multiples, replicates, and/or duplicates of one another. However, in some instances, a misalignment between the first image 800 and the second image 850 may prevent the deduplication engine 110 from identifying the first image 800 and the second image 850 as multiples, replicates, and/or duplicates of one another at least because any similarities in the respective vector fields would be obscured by a corresponding misalignment between the two vector fields. The misalignments between the first image 800 and the second image 850 may arise from translations, rotations, and/or inversions of one or both images during scanning. For example, the first image 800 and the second image 850 may be duplicate images but if the first image 800 is scanned upside down while the second image 850 is scanned right side up, the correspondence between the two images may be obscured by the inversion of the first image 800.
Accordingly, the deduplication engine 110 may determine the difference between the first vector field of the first image 800 and the second vector field of the second image 850 all while aligning the first image 800 and the second image 850 to minimize the difference between the first vector field and the second vector field. That is, the deduplication engine 110 may continue to adjust the alignment between the first image 800 and the second image 850 in order to minimize any possible misalignment between the two images when computing the difference between the corresponding vector fields. In doing so, the deduplication engine 110 is able to detect instances where the similarities between the first image 800 and the second image 850 are obscured by a misalignment between the first image 800 and the second 850. In some cases, while the deduplication engine 110 adjusts the alignment between the first image 800 and the second image 850, the deduplication engine 110 may determine the difference between the first vector field of the first image 800 and the second vector field of the second image 850 by at least subtracting the x-components of the first vector field and the second vector field separately from the y-components of the two vector fields.
While determining the difference between the first vector field and the second vector field, the deduplication engine 110 may translate the two vector fields, for example, along the x-axis and/or the y-axis at grid-size intervals to determine an alignment of the first image 800 and the second image 850 where the difference between the two vector fields is minimized. Doing so may enable the deduplication engine 110 to detect instances where the similarities in the vector fields of the first image 800 and the second image 850 are obscured by a translational misalignment between the first image 800 and the second image 850. Alternatively and/or additionally, while determining the difference between the first vector field and the second vector field, the two vector fields may also be rotated and/or inverted to determine an alignment of the first image 800 and the second image 850 that minimizes the difference between the two vector fields. Rotating the two vector fields may enable the deduplication engine 110 to detect instances where the similarities in the vector fields of the first image 800 and the second image 850 are obscured by a rotational misalignment between the two images. Meanwhile, inverting one vector field relative to the other vector field may enable the deduplication engine 110 to detect instances where the similarities in the vector fields of the first image 800 and the second image 850 are obscured by an inversion of one of the images. To enable the rotation of a vector field about its center axis, the deduplication engine 110 may use polar coordinates such that the rotation of the vector fields is quantified in degree intervals. An example of a vector field mapped to polar coordinates is shown in
In some example embodiments, the deduplication engine 110 may determine, based at least on the minimal difference between the first vector field of the first image 800 and the second vector field of the second image 850, whether the first image 800 and the second image 850 are multiples, replicates, and/or duplicates of one another. For example, where the difference between the first vector field of the first image 800 and the second vector field of the second image 850 satisfy the first threshold (e.g., a first value x that is near zero), the deduplication engine 110 may determine that the first image 800 and the second image 850 are duplicates of one another. Alternatively, where the difference between the first vector field of the first image 800 and the second vector field of the second image 850 fails to satisfy the first threshold but satisfies a second threshold that is greater than the first threshold (e.g., a second value y that is greater than the first value x), the deduplication engine 110 may determine that the first image 800 and the second image 850 replicates and/or multiples. In some cases, the first image 800 and the second image 850 may be identified as images of adjacent slices from the same block of the tissue sample when where the difference between their respective vector fields satisfies the second threshold. Identifying the first image 800 and the second image 850 as adjacent images may enable the deduplication engine 110 to restore the original sequence of the images from the block of the tissue sample.
At 1002, the deduplication engine 110 may identify, based on the identification information 125 associated with the image set 115, one or more of a duplicate image, a replicate image, and a multiple image present in the image set 115. In some example embodiments, the deduplication engine 110 may deduplicate the image set 115 based on the identification information 125 associated with the image set 115. The identification information 125 associated with the image set 115 may include the manifest 510 and the metadata 520, which may be stored as a part of the manifest 510 and/or embedded with the individual images in the image set 115. Moreover, the identification information 125 associated with the image set 115 may include a variety of data including, for example, a sample identifier, a patient identifier, a block identifier, a slide identifier, an imaging modality, a scanning parameter.
Accordingly, the deduplication engine 110 may identify a first image in the image set 115 as a duplicate of a second image in the image set 115 if the first image and the second image are associated with matching sample identifiers, patient identifiers, block identifiers, and slide identifiers. Where the first image and the second image are associated with matching sample identifiers, patient identifiers, and block identifiers but different slide identifiers, the deduplication engine 110 may identify the first image as a replicate of the second image. Furthermore, where the first image and the second image are associated with matching sample identifiers and patient identifiers but different block identifiers and different slide identifiers, the deduplication engine may identify the first image and the second image as multiples of one another.
At 1004, the deduplication engine 110 may identify, based at least on a metric indicative of a minimal difference achieved between each pair of images in the image set 115 while adjusting an alignment therebetween, the one or more of a duplicate image, a replicate image, and a multiple image present in the image set 115. In some example embodiments, the deduplication engine 110 may deduplicate the image set 115 based on a metric indicative of a minimal difference achieved between each pair of images in the image set 115, such as a minimal difference in the respective vector fields of the two images, while adjusting the alignment therebetween. For example, the deduplication engine 110 may determine the difference between the first vector field of the first image 800 and the second vector field of the second image 850 while adjusting the alignment between the first image 800 and the second image 850 through translation, rotation, inversion, and/or the like. The minimal difference that is achieved between the two vector fields while adjusting the alignment between the first image 800 and the second image 850 may be indicative of the similarities between the two images or lack thereof. For instance, where the difference between the first vector field of the first image 800 and the second vector field of the second image 850 satisfy the first threshold (e.g., a first value x that is near zero), the deduplication engine 110 may determine that the first image 800 and the second image 850 are duplicates of one another. Alternatively and/or additionally, where the difference between their respective vector fields does not satisfy the first threshold but satisfies a second threshold that is greater than the first threshold, the first image 800 and the second image 850 may be identified as replicates, and more specifically, adjacent images from the same block of the tissue sample.
In some cases, the metric-based deduplication may be performed on the remaining images in the image set 115 subsequent to the deduplication performed based on the identification information 125. That is, once the deduplication engine 110 identified one or more duplicate images, replicate images, and/or multiple images in the image set 115 based on the identification information 125, the deduplication engine 110 may perform metric-based deduplication on the remainder of the image set 115. Alternatively, after the deduplication engine 110 identified one or more duplicate images, replicate images, and/or multiple images in the image set 115 based on the identification information 125, the deduplication engine 110 may perform metric-based deduplication on the entirety of the image set 115. Doing so may enable the deduplication engine 110 to detect any errors that may be present in the identification information 125.
At 1006, the deduplication engine 110 may update the identification information 125 associated with the image set 115 to indicate the one or more of the duplicate image, the replicate image, and the multiple image identified as present within the image set 115. For instance, in some example embodiments, the deduplication engine 110 may update the identification information 125 (e.g., the manifest 510, the metadata 520, and/or the like) by at least setting one or more flags to indicate whether each image in the image set 115 is a duplicate, replicate, and/or multiple. In some cases, the deduplication engine 110 may also update the identification information 125 to indicate the sequence of the images within the image set 115. For example, the identification information 125 may be updated to identify replicate images depicting adjacent slices of tissue from the same block of the tissue sample and multiple images depicting slices of tissue from adjacent blocks of the tissue sample. Alternatively and/or additionally, the deduplication engine 110 may update the identification information 125 to correct for any discrepancies between the identification information 125 and the one or more deduplicate images, replicate images, and multiple images identified by the deduplication engine 110 through metric-based deduplication.
At 1052, the deduplication engine 110 may receive the image set 115. For example, the deduplication engine 110 may receive, from the database 120, the image set 115, which may be a set of whole slide images depicting slices of tissue from one or more blocks of a tissue sample of a patient. In some cases, the image set 115 may include duplicate images, which are identical or near-identical whole slide images depicting the same tissue slice. Alternatively and/or additionally, the image set 115 may include replicate images, which are whole slide images depicting adjacent tissue slices from the same block of the tissue sample. In some instances, the image set 115 may also include multiple images, which are whole slide images depicting tissue slices from different blocks of the tissue specimen. In the case of replicates and multiples, the image set 115 may include whole slide images depicting adjacent slices of the tissue sample as well as whole slide images depicting slices of the tissue sample from adjacent blocks of the tissue sample. Replicate images and/or multiple images in the image set 115 may, in some cases, be out of their original sequence within individual blocks and across adjacent blocks.
At 1054, the deduplication engine 110 may standardize the identification information 125 associated with the image set 115. In some example embodiments, the deduplication engine 110 may standardize the identification information 125 associated with the image set 115 by at least reconciling the different formats and naming conventions that may be present therein. To do so, the deduplication engine 110 may apply one or more techniques, including autology techniques such as Levenshtein distance, to standardize the identification information 125 associated with the image set 115. The standardization of the identification information 125 may enable the deduplication engine 110 to perform a subsequent analysis of the identification information 125.
At 1056, the deduplication engine 110 analyze the manifest 510 associated with the image set 115. At 1057, the deduplication engine 110 may identify, based on the analysis of the manifest 510, one or more duplicate, replicate, and/or multiple images present in the image set 115. For example, the manifest 510 associated with the image set 115 may include one or more identifiers, such as one or more alpha, numeric, alphanumeric, or the like, identifiers. Furthermore, the manifest 510 may include, for each image included in the image set 115, a slide identifier, a tissue type or tissue identifier, a specimen or sample identifier, a block identifier, a patient identifier, and/or the like. Accordingly, the deduplication engine 110 may analyze the manifest 510 associated with the image set 115 to detect, for example, identical and/or sequential identifiers indicative of one or more duplicate, replicate, and/or multiple images present within the image set 115.
At 1058, the deduplication engine 110 may analyze the metadata 520 associated with the image set 115. At 1059, the deduplication engine 110 may identify, based on the analysis of the metadata 520, one or more duplicate, replicate, and/or multiple images present in the image set 115. The metadata 520 associated with the image set 520 may be stored as a part of the manifest 510 and/or embedded within the individual images in the image set 115. In some cases, the metadata 520 may include image timestamps, image modalities, slide identifiers, barcodes, scanning parameters, and/or the like. Accordingly, in some example embodiments, the deduplication engine 110 may also analyze the metadata 520 associated with the image set 115 detect one or more duplicate, replicate, and/or multiple images present within the image set 115.
At 1060, the deduplication engine 110 may determine a metric associated with the image set 115. At 1061, the deduplication engine 110 may predict, based on the metric, one or more one or more duplicate, replicate, and/or multiple images present in the image set 115. In some example embodiments, the deduplication engine 110 may determine a metric corresponding to a difference in the vector fields between one or more pairs of images in the image set 115. In some cases, the deduplication engine 110 may align each pair of images in the image set 115 such that the difference in the respective vector fields of the images is minimized. The presence of duplicates, replicates, and/or multiples may be determined based on the minimal difference in the vector fields achieved by adjusting the alignment between each pair of images. For example, where the difference between the vector fields of two aligned images satisfy a first threshold (e.g., a first value x that is near zero), the deduplication engine 110 may determine that the two images are duplicates of one another. Alternatively, where the difference between the vector fields of the two images fails to satisfy the first threshold but satisfies a second threshold that is greater than the first threshold (e.g., a second value y that is greater than the first value x), the deduplication engine 110 may determine that the two images are replicates and/or multiples. In some cases, where the difference between the vector fields of the two images satisfy the second threshold but not the first threshold, the deduplication engine 110 may determine that the two images depict adjacent tissue slices from the same block of the tissue sample.
At 1062, the deduplication engine 110 may confirm the one or more images identified as a replicate, duplicate, and/or a multiple. For example, in some example embodiments, the deduplication engine 110 may confirm the one or more images identified as a duplicate before flagging and/or removing the one or more images from the image set 115. In some cases, the deduplication engine 110 may also confirm, based on one or more user inputs, the one or more images identified as depicting adjacent slices from the same block of the tissue sample before restoring the original sequence of the images from the block of the tissue sample. For instance, the one or more user inputs may indicate whether the one or more images are correctly identified as depicting adjacent slices from the same block of the tissue sample, in which case the deduplication engine 110 may adjust, based at least on the ordering of the one or more images identified as depicting adjacent slices from the same block of the tissue sample, the current sequence of images to restore the original sequence of the images. Where the one or more user inputs rejects the one or more images identified as depicting adjacent slices from the same block of the tissue sample, the deduplication engine 110 may maintain the current sequence of images.
At 1102, the deduplication engine 110 may compare image information from the manifest 400 associated with a set of input images. At 1104, the deduplication engine 110 may identify, based on the comparison of image information from the manifest 510, one or more duplicates, replicates, and/or multiples present in the image set 115. As noted, the manifest 510 for the image set 115 may include one or more identifiers including, for example, alpha, numeric, and/or alphanumeric identifiers. Examples of such identifiers may include, for each image included in the image set 115, a slide identifier, a tissue type or tissue identifier, a specimen or sample identifier, a block identifier, a patient identifier, and/or the like. Accordingly, the deduplication engine 110 may analyze the manifest 510 of the image set 115 to identify one or more duplicates, replicates, and/or multiples.
At 1106, the deduplication engine 110 may update the manifest 510 to indicate the one or more duplicates, replicates, and/or multiples identified as present in the image set 115. In some example embodiments, the manifest 115 of the image set 115 may include one or more columns occupied by one or more flags identifying the corresponding images as a replicate, a duplicate, and/or a multiple. Accordingly, the deduplication engine 110 may, based on whether an image is identified as a duplicate, a replicate, and/or a multiple, one or more corresponding flags in the manifest 510 of the image set 115.
At 1202, the deduplication engine 110 may extract the metadata 520 associated with the image set 115. At 1204, the deduplication engine 110 may compare the extracted metadata 520. In some example embodiments, the metadata 520 associated with image set 115 may be stored as a part of the manifest 510 and/or embedded with the individual images in the image set 115. As noted, in some cases, the metadata 520 may include image timestamps, image modalities, slide identifiers, barcodes, scanning parameters, and/or the like. Accordingly, the deduplication engine 110 may compare the metadata 520 associated with a first image and a second image by at least comparing, for example, a first timestamp at which the first image was acquired, a second timestamp at which the second image was acquired, a first modality used to acquire the first image, a second modality used to acquire the second image, a first slide identifier associated with the first tissue sample, a second slide identifier associated with the second tissue sample, a first barcode associated with the first tissue sample, a second barcode associated with the second tissue sample, a first scanning parameter associated with the first image, and a second scanning parameter associated with the second image.
At 1206, the deduplication engine 110 may identify, based at least on the comparison of the extracted metadata 520, one or more duplicates, replicates, and/or multiples. In some example embodiments, one or more duplicates, replicates, and/or multiples present within the image set 115 may be identified based on the comparison of the corresponding metadata 520. Moreover, in some cases, the deduplication engine 110 may update the manifest 510 of the image set 115, for example, by setting one or more corresponding flags, to indicate which images are duplicates, replicates, and/or multiples.
At 1302, the deduplication engine 110 may preprocess a first image and a second image. In some example embodiments, the deduplication engine 110 may preprocess the first image and the second image by converting the images into grayscale images. Furthermore, the deduplication engine 110 may preprocess the first image and the second image by dividing each image into equal sized sections (e.g., a grid and/or the like). In some cases, the preprocessing of the first image and the second image may further include converting the scale of pixel intensity values from a first scale [0,1] to a second scale [−1,1].
At 1304, the deduplication engine 110 may generate a first vector field for the first image and a second vector field for the second image. In some example embodiments, the vector field for an image may include, for each section within the image, a vector whose direction and magnitude (e.g., length) correspond to the change in the intensity values of the pixels (e.g., from black to white) included in the section of the image.
At 1306, the deduplication engine 110 may compare the first vector field of the first image and the second vector field of the second image. For example, in some cases, the deduplication engine 110 may compare the first vector field of the first image and the second vector field of the second image by at least subtracting the x-components of the first vector field and the second vector field separately from the y-components of the two vector fields. In some example embodiments, as a part of comparing the first vector field of the first image and the second vector field of the second image, the deduplication engine 110 may also align the first image and the second image to minimize the difference between the two vector fields. For example, while determining the difference between the first vector field and the second vector field, the deduplication engine 110 may translate the two vector fields, for example, along the x-axis and/or the y-axis at grid-size intervals to determine an alignment of the two images where the difference between the two vector fields is minimized. Alternatively and/or additionally, while determining the difference between the first vector field and the second vector field, the two vector fields may also be rotated and/or inverted to determine an alignment of the first image and the second image that minimizes the difference between the two vector fields.
At 1308, the deduplication engine 110 may generate, based at least on the comparison of the first vector field and the second vector field, a metric. In some example embodiments, the deduplication engine 110 may generate a metric corresponding to the minimal difference in the first vector field and the second vector field that the deduplication engine 110 was able to achieve while adjusting the alignment between the first image and the second image.
At 1310, the deduplication engine 110 may identify, based at least on the metric, the first image and/or the second image as a replicate, a duplicate, and/or a multiple. In some example embodiments, the deduplication engine 110 may determine, based at least on the minimal difference between the first vector field of the first image and the second vector field of the second image, whether the first image and the second image are multiples, replicates, and/or duplicates of one another. For example, where the difference between the first vector field and the second vector field satisfy a first threshold (e.g., a first value x that is near zero), the deduplication engine 110 may determine that the two corresponding images are duplicates of one another. Alternatively, where the difference between the two vector fields fails to satisfy the first threshold but satisfies a second threshold that is greater than the first threshold (e.g., a second value y that is greater than the first value x), the deduplication engine 110 may determine that the first image and the second image are replicates and/or multiples.
In view of the above-described implementations of subject matter this application discloses the following list of examples, wherein one feature of an example in isolation or more than one feature of said example taken in combination and, optionally, in combination with one or more features of one or more further examples are further examples also falling within the disclosure of this application:
As shown in
The memory 1420 is a computer readable medium such as volatile or non-volatile that stores information within the computing system 1400. The memory 1420 can store data structures representing configuration object databases, for example. The storage device 1430 is capable of providing persistent storage for the computing system 1400. The storage device 1430 can be a floppy disk device, a hard disk device, an optical disk device, or a tape device, or other suitable persistent storage means. The input/output device 1440 provides input/output operations for the computing system 1400. In some implementations of the current subject matter, the input/output device 1440 includes a keyboard and/or pointing device. In various implementations, the input/output device 1440 includes a display unit for displaying graphical user interfaces.
According to some implementations of the current subject matter, the input/output device 1440 can provide input/output operations for a network device. For example, the input/output device 1440 can include Ethernet ports or other networking ports to communicate with one or more wired and/or wireless networks (e.g., a local area network (LAN), a wide area network (WAN), the Internet).
In some implementations of the current subject matter, the computing system 1400 can be used to execute various interactive computer software applications that can be used for organization, analysis and/or storage of data in various (e.g., tabular) format (e.g., Microsoft Excel®, and/or any other type of software). Alternatively, the computing system 1400 can be used to execute any type of software applications. These applications can be used to perform various functionalities, e.g., planning functionalities (e.g., generating, managing, editing of spreadsheet documents, word processing documents, and/or any other objects, etc.), computing functionalities, communications functionalities, etc. The applications can include various add-in functionalities or can be standalone computing products and/or functionalities. Upon activation within the applications, the functionalities can be used to generate the user interface provided via the input/output device 1440. The user interface can be generated and presented to a user by the computing system 1400 (e.g., on a computer screen monitor, etc.).
One or more aspects or features of the subject matter described herein can be realized in digital electronic circuitry, integrated circuitry, specially designed ASICs, field programmable gate arrays (FPGAs) computer hardware, firmware, software, and/or combinations thereof. These various aspects or features can include implementation in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which can be special or general purpose, coupled to receive data and instructions from, and to transmit data and instructions to, a storage system, at least one input device, and at least one output device. The programmable system or computing system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
These computer programs, which can also be referred to as programs, software, software applications, applications, components, or code, include machine instructions for a programmable processor, and can be implemented in a high-level procedural and/or object-oriented programming language, and/or in assembly/machine language. As used herein, the term “machine-readable medium” refers to any computer program product, apparatus and/or device, such as for example magnetic discs, optical disks, memory, and Programmable Logic Devices (PLDs), used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term “machine-readable signal” refers to any signal used to provide machine instructions and/or data to a programmable processor. The machine-readable medium can store such machine instructions non-transitorily, such as for example as would a non-transient solid-state memory or a magnetic hard drive or any equivalent storage medium. The machine-readable medium can alternatively or additionally store such machine instructions in a transient manner, such as for example, as would a processor cache or other random access memory associated with one or more physical processor cores.
To provide for interaction with a user, one or more aspects or features of the subject matter described herein can be implemented on a computer having a display device, such as for example a cathode ray tube (CRT) or a liquid crystal display (LCD) or a light emitting diode (LED) monitor for displaying information to the user and a keyboard and a pointing device, such as for example a mouse or a trackball, by which the user may provide input to the computer. Other kinds of devices can be used to provide for interaction with a user as well. For example, feedback provided to the user can be any form of sensory feedback, such as for example visual feedback, auditory feedback, or tactile feedback; and input from the user may be received in any form, including acoustic, speech, or tactile input. Other possible input devices include touch screens or other touch-sensitive devices such as single or multi-point resistive or capacitive track pads, voice recognition hardware and software, optical scanners, optical pointers, digital image capture devices and associated interpretation software, and the like.
The subject matter described herein can be embodied in systems, apparatus, methods, and/or articles depending on the desired configuration. The implementations set forth in the foregoing description do not represent all implementations consistent with the subject matter described herein. Instead, they are merely some examples consistent with aspects related to the described subject matter. Although a few variations have been described in detail above, other modifications or additions are possible. In particular, further features and/or variations can be provided in addition to those set forth herein. For example, the implementations described above can be directed to various combinations and subcombinations of the disclosed features and/or combinations and subcombinations of several further features disclosed above. In addition, the logic flows depicted in the accompanying figures and/or described herein do not necessarily require the particular order shown, or sequential order, to achieve desirable results. For example, the logic flows may include different and/or additional operations than shown without departing from the scope of the present disclosure. One or more operations of the logic flows may be repeated and/or omitted without departing from the scope of the present disclosure. Other implementations may be within the scope of the following claims.
This application claims priority to U.S. Provisional Application No. 63/405,332, filed on Sep. 9, 2023 and entitled “DEDUPLICATION OF SAMPLE IMAGES,” the disclosure of which is incorporated herein by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
63405332 | Sep 2022 | US |
Number | Date | Country | |
---|---|---|---|
Parent | PCT/US2023/073742 | Sep 2023 | WO |
Child | 19072454 | US |