The present invention relates to histology slide images, and in particular, to methods, systems, and computer program products for analyzing and normalizing color vectors and color intensity data of histology slide images.
In many biological fields, tissue samples are taken from a subject for analysis. One common way of analyzing the tissue sample is to treat it with stains that have selective affinities for different biological substances. The majority of stains only absorb light, and the stained slides are therefore viewed using a microscope with a light illuminating the sample from below. If no stain is present, all of the light will pass through, appearing bright white. Areas where the stain has adhered to a substance in the tissue will absorb some of the light. The amount of light absorbed depends on many factors. For a given unit of stain, a certain amount of light in each spectrum will be absorbed. In the case of multispectral imaging, this process can be quite complicated. For example, standard 24-bit red-green-blue (RGB) cameras can be used to obtain images in the three wavelengths (red-green-blue) of light. The proportion of each wavelength absorbed forms the stain vector. The stain vector not only varies greatly among different stains but can also vary significantly for the same stain depending on such factors as the manufacturer, the storage conditions prior to use, and the method of application.
The overall amount of light absorbed also varies between slides prepared differently. The two most prominent factors that affect the intensity of a slide are the relative amounts of stain added in the original treatment and the subsequent storage and handling of the slide, as stains can fade when exposed to light. The amount of light absorbed is referred to as the stain intensity.
The absolute color values of a slide have many influences, and generally only one of the influences is the biological component, i.e., the actual amount of the cellular substance to which a particular stain will attach. For example, in the most popular staining method for medical diagnosis, hematoxylin selectively stains nucleic acids a blue-purple hue while eosin stains proteins a bright pink color. Other variations result from staining compounds that do not absorb the exact same amounts of light, therefore exhibiting slightly different colors.
Most current applications for analyzing the images concentrate on shape features and are thus not affected by the color irregularities between various slides except when it interferes with segmentation on which the shape features are based. When color information is utilized, the raw color values obtained from the scanner can be used. This approach adds some information, but differences in staining are not taken into account.
Methods, systems and computer program products for normalizing histology slide images are provided. A color vector for pixels of the histology slide images is determined. An intensity profile of a stain for the pixels of the histology slide images is normalized. Normalized image data of the histology slide images is provided including the color vector and the normalized intensity profile of a stain for the pixels of the histology slide images.
The accompanying drawings, which are incorporated in and constitute a part of the specification, illustrate embodiments of the invention and, together with the description, serve to explain principles of the invention.
FIGS. 1 and 2A-2B are flowcharts illustrating operations according to some embodiments of the present invention.
The present invention now will be described hereinafter with reference to the accompanying drawings and examples, in which embodiments of the invention are shown. This invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.
Like numbers refer to like elements throughout. In the figures, the thickness of certain lines, layers, components, elements or features may be exaggerated for clarity.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the singular forms “a,” “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises” and/or “comprising,” when used in this specification, specify the presence of stated features, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, elements, components, and/or groups thereof. As used herein, the term “and/or” includes any and all combinations of one or more of the associated listed items. As used herein, phrases such as “between X and Y” and “between about X and Y” should be interpreted to include X and Y. As used herein, phrases such as “between about X and Y” mean “between about X and about Y.” As used herein, phrases such as “from about X to Y” mean “from about X to about Y.”
Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the specification and relevant art and should not be interpreted in an idealized or overly formal sense unless expressly so defined herein. Well-known functions or constructions may not be described in detail for brevity and/or clarity.
It will be understood that, although the terms “first,” “second,” etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another. Thus, a “first” element discussed below could also be termed a “second” element without departing from the teachings of the present invention. The sequence of operations (or steps) is not limited to the order presented in the claims or figures unless specifically indicated otherwise.
The present invention is described below with reference to block diagrams and/or flowchart illustrations of methods, apparatus (systems) and/or computer program products according to embodiments of the invention. It is understood that each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, and/or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer and/or other programmable data processing apparatus, create means for implementing the functions/acts specified in the block diagrams and/or flowchart block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instructions which implement the function/act specified in the block diagrams and/or flowchart block or blocks.
The computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer-implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions/acts specified in the block diagrams and/or flowchart block or blocks.
Accordingly, the present invention may be embodied in hardware and/or in software (including firmware, resident software, micro-code, etc.). Furthermore, embodiments of the present invention may take the form of a computer program product on a computer-usable or computer-readable storage medium having computer-usable or computer-readable program code embodied in the medium for use by or in connection with an instruction execution system. In the context of this document, a computer-usable or computer-readable medium may be any medium that can contain or store the program for use by or in connection with the instruction execution system, apparatus, or device.
As illustrated in
As shown in
Exemplary techniques for determining and/or correcting color vectors for pixels of the histology slide images are discussed in the Example section below.
As shown in
Exemplary techniques for normalizing and/or correcting intensity variations are discussed in the Example section below.
As illustrated in
As shown in
As will be appreciated by those of skill in the art, the operating system 152 shown in
While the present invention is illustrated, for example, with reference to the histology slide normalization module 150 and the feature analysis module 152 being an application program in
The I/O data port can be used to transfer information between the data processing system 100 and the histology slide imaging system 120 or another computer system or a network (e.g., the Internet) or to other devices controlled by the processor. These components may be conventional components such as those used in many conventional data processing systems that may be configured in accordance with the present invention to operate as described herein. Therefore, the histology slide normalization module 150 can be used to analyze histology slide imaging data 160 that has been previously collected and/or data 160 that is collected from the histology slide imaging system 120. The histology slide imaging system 120 can be a scanning system, e.g., the Aperio Scanscope® (Arperio Technologies, Inc., Vista, Calif.). The feature analysis module 152 can be used to analyze features in the histology slide images, for example, using the normalized image data provided by the histology slide normalization module 150 to detect pathologies in the histology slide images.
Although embodiments according to the present invention are described herein with respect to two stains for detecting melanoma, it should be understood that three or more stains can be used and/or other suitable types of histology slides can be used. Moreover, any suitable staining technique can be used. For example, the slides could be stained with any combination of two or three or more stains including hematoxylin, eosin, Periodic acid Schiff, and/or immunohistochemistry stains (such as MART-1)
Embodiments according to the present invention will now be described with respect to the following non-limiting examples.
The red-green-blue color values are converted to their corresponding optical density (OD) values
OD=−log10 (I)
Once the correct vectors are determined, e.g., as described herein, a simple color deconvolution scheme is used to transform the color values into quantitative values of interest:
OD=VS
S=V−1OD
Stain Vector Variation and Correction
Slide preparation can vary widely due to different stain manufacturers, different staining procedures, and different storage times. It is assumed that there is a specific stain vector corresponding to each of the two stains present in the image, and that the resulting color (in OD space) of every pixel is a linear combination of these stain vectors. Since there is a non-negative weight on each component, every value generally exists between the two stain vectors. Accordingly, the techniques described herein can locate the fringe of the pixel distribution rather than searching for peaks. If noise were not a factor, the minimum and maximum along the identified direction may be used. Instead, robust versions of the minimum and maximum are used by taking the αth and the (100−α)th percentile. Empirically, α=1 provides robust results.
The following techniques can be used to identify the particular stain vectors for each image based on the colors that are present. An OD value of zero (0) corresponds to a pixel that is all white and essentially nothing on the slide absorbed any light. For stability reasons, the pixels with nearly no stain (low OD) were thresholded. After empirical analysis, a threshold value of β=0.15 was found to provide robust results while removing a relatively small amount of data. Acceptable results are achieved for a wide range of both α and β. For example, α can range between 0 and 50, where the value 50 would result in the median value.
The shortest path between two unit-norm color vectors on the sphere is the geodesic path. This line appears to be curved in a spherical coordinate decomposition unless it would correspond to change in only one direction or the other. By finding this specific geodesic direction, the OD transformed pixels can be projected onto it in order to find the endpoints that correspond to the stain vectors.
The first step in this process is to calculate the plane that the vectors form. This is done by forming a plane from the two vectors corresponding to the two largest singular values of the SVD decomposition of the OD transformed pixels. All of these OD transformed pixels are then projected onto this plane, and subsequently normalized to unit length. The projection line is shown to be curved in
The steps are summarized as follows: 1) the RGB slide is converted to the OD; 2) data with an OD intensity of less than β is removed; 3) the SVD of the OD tuples is calculated; 4) a plane is created from the SVD directions corresponding to the two largest singular values; 5) data is projected onto the plane and normalized to unit length; 6) an angle of each point with respect to the first SVD direction is calculated; 7) robust extremes (αth and (100−α)th percentiles) of the angle are identified; and 8) the extreme values are converted to OD space.
This method was performed on twelve different slides with some variation. Before the use of this method, standard vectors were computed using manual methods to select an area on the slide that only contains one stain and then to calculate an average stain vector from the area to identify vectors that could adequately describe all twelve slides. The results of these computations, along with the standard vectors, are shown in
Intensity Variation and Correction
The intensity of a particular stain depends on the original strength of the stain, how much of it was applied to the tissue during the staining procedure, how much bleaching has occurred since the sample was originally processed, and finally how much of the reactive protein is present in the material.
The intensity of a particular stain depends on the original strength of the stain, the staining procedure, how much fading has occurred since the sample was originally processed, and finally how much of the cellular substance of interest is present in the material. The last quantity is what we actually want to measure. Removing the confounding factors that degrade the signal is necessary for direct analytical analysis of these samples.
An assumption can be made that the amount of protein or nucleic acid is a random variable that is scaled by the confounding factors mentioned previously. For each stain in question, the intensity histograms for all pixels that have a majority of that stain is calculated. The 99th percentile of these intensity values is identified and used as a robust approximation of the maximum. This value was shown experimentally to be a good simple descriptor of the histogram by analyzing several patches of each slide; however, other values can be used. All intensity histograms are then scaled to have the same pseudo-maximum and are then able to be compared with each other.
As can be seen from
Analysis of five slides diagnosed with melanoma and seven slides containing benign nevi (common moles) was performed using a variety of shape and stain-based features. The slides had all been stained with hematoxylin and eosin and scanned at 20×. For each slide, a large number of nuclei are segmented and features calculated for each of them. The statistical method known as Distance Weighted Discrimination (DWD) (J S Marron, M J Todd, and J Ahn, “Distance weighted discrimination,” in J. of the Am. Statistical Assoc., 2007, vol. 102, pp. 1267-1271) was used to find the optimal separation direction between melanoma and nevi based on this feature-space.
While the examples described herein have been performed using hematoxylin and eosin stained slides of melanomas and nevi, it should be understood that similar techniques applicable to other histologic stains and tissues. The techniques for obtaining the optimal stain vectors have been evaluated on slides with various stain combinations satisfactorily. When three or more stains are present in a slide, the results are sometimes inconsistent.
The techniques described herein have greatly improved the ability to quantitatively analyze histology slides and have improved the results of our investigations. Automating the process can accommodate larger datasets and enable a level of reproducibility not guaranteed with manual selection methods. The methods presented are easy to implement, and computation time is much improved over the non-negative matrix factorization (NMF) methods (A Rabinovich, S Agarwal, C A Laris, J H Price, and S Belongie, “Unsupervised color decomposition of histologically stained tissue samples,” in Adv. In Neural Inf. Proc. Systems, 2003). Embodiments according to the present invention may be applied additional research into medical aspects that use stained histology slides for diagnosis, prognosis or basic research, including immunohistochemistry staining of tissue and/or techniques for diagnosing disease in other types of images.
The foregoing is illustrative of the present invention and is not to be construed as limiting thereof. Although a few exemplary embodiments of this invention have been described, those skilled in the art will readily appreciate that many modifications are possible in the exemplary embodiments without materially departing from the novel teachings and advantages of this invention. Accordingly, all such modifications are intended to be included within the scope of this invention as defined in the claims. Therefore, it is to be understood that the foregoing is illustrative of the present invention and is not to be construed as limited to the specific embodiments disclosed, and that modifications to the disclosed embodiments, as well as other embodiments, are intended to be included within the scope of the appended claims. The invention is defined by the following claims, with equivalents of the claims to be included therein.
This applications claims priority to U.S. Provisional Application Ser. No. 61/269,566, filed Jun. 26, 2009, the disclosure of which is hereby incorporated by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
61269566 | Jun 2009 | US |