The disclosed concept relates generally to digital pathology, and in particular, to a system and method for parametrically modelling a dictionary of diagnostically relevant histological patterns and using the modeling scheme to quantify the existence of the patterns in digital pathology images in order to be able to classify the state of a disease in the digital pathology images.
Advances in digital imaging technologies and computational power have now paved the way for a major shift in pathology workflow based on artificial intelligence (AI). It is now feasible to rapidly image microscope slides at clinical volumes, and it is now permissible to use these digital pathology images for real clinical diagnosis given recent regulatory approvals. A critical addition is computational pathology in the form of novel machine learning (ML) tools that pathologists could use to greatly improve their diagnostic performance, especially in terms of accuracy and efficiency Such tools could also be applied throughout the pathology laboratory for other applications such as case triage or automated real-time quality assurance. Numerous studies have now shown early promise, and it is becoming clear that pathologists and their patients could greatly benefit from access to powerful computational pathology tools.
There is widespread enthusiasm for AI in digital pathology, but this is strongly tempered by caution and reasonable concern about potential risks. Further, almost all early computational pathology attempts have used convolutional neural networks, also known as deep learning. Deep learning is powerful, but it is opaque, like a “black-box” that one cannot open to peer inside and see what it is doing or exactly how it is working, or even whether it is working as intended. Such systems give answers, but do not allow the pathologist to ask “why?”.
In one embodiment, a computational pathology method is provided that is broadly applicable to many histologies, including those relating to high-level and organ-specific disease entities, including both tumor and non-tumor pathology. The method includes receiving multi-parameter cellular and/or sub-cellular imaging data for an image of a tissue sample, and locating and segmenting a plurality of tissue components of the tissue sample in the multi-parameter cellular and sub-cellular imaging data to generate segmented multi-parameter cellular and sub-cellular imaging data. The method further includes applying a parametric feature modelling scheme to certain of the tissue components in the segmented multi-parameter cellular and sub-cellular imaging data, wherein the parametric feature modelling scheme is generated from a dictionary of pre-existing diagnostically relevant histological patterns and comprises a number of structural features adapted for defining a number of disease entities of a disease, and wherein the applying includes determining a quantification of each of the structural features for the tissue sample, and classifying a state of the disease in the tissue sample based the determined quantification of each of the structural features
In another embodiment, a computerized computational pathology system for discriminating diagnostic tissue patterns in multi-parameter cellular and sub-cellular imaging data for a number of tissue samples from a number of patients or a number of multicellular in vitro models is provided. Like the method just described, the system is broadly applicable to many histologies, including those relating to high-level and organ-specific disease entities, including both tumor and non-tumor pathology. The system includes a processing apparatus, wherein the processing apparatus includes a number of components configured for: (i)locating and segmenting a plurality of tissue components of the tissue sample in the multi-parameter cellular and sub-cellular imaging data to generate segmented multi-parameter cellular and sub-cellular imaging data, (ii) applying a parametric feature modelling scheme to certain of the tissue components in the segmented multi-parameter cellular and sub-cellular imaging data, wherein the parametric feature modelling scheme is generated from a dictionary of pre-existing diagnostically relevant histological patterns and comprises a number of structural features adapted for defining a number of disease entities of a disease, and wherein the applying includes determining a quantification of each of the structural features for the tissue sample; and (iii) classifying a state of the disease in the tissue sample based the determined quantification of each of the structural features.
A full understanding of the invention can be gained from the following description of the preferred embodiments when read in conjunction with the accompanying drawings in which:
As used herein, the singular form of “a”, “an”, and “the” include plural references unless the context clearly dictates otherwise.
As used herein, the statement that two or more parts or components are “coupled” shall mean that the parts are joined or operate together either directly or indirectly, i.e., through one or more intermediate parts or components, so long as a link occurs.
As used herein, the term “number” shall mean one or an integer greater than one (i.e., a plurality).
As used herein, the terms “component” and “system” are intended to refer to a computer related entity, either hardware, a combination of hardware and software, software, or software in execution. For example, a component can be, but is not limited to being, a process running on a processor, a processor, an object, an executable, a thread of execution, a program, and/or a computer. By way of illustration, both an application running on a server and the server can be a component. One or more components can reside within a process and/or thread of execution, and a component can be localized on one computer and/or distributed between two or more computers. While certain ways of displaying information to users are shown and described with respect to certain figures or graphs as screenshots, those skilled in the relevant art will recognize that various other alternatives can be employed.
As used herein, the term “multi-parameter cellular and sub-cellular imaging data” shall mean data obtained from generating a number of images from a number of a sections of tissue which provides information about a plurality of measurable parameters at the cellular and/or sub-cellular level in the sections of tissue. Multi-parameter cellular and sub-cellular imaging data may be created by a number of different imaging modalities, such as, without limitation, any of the following: transmitted light (e.g., a combination of H&E and/or IHC (1 to multiple biomarkers)); fluorescence; immunofluorescence (including but not limited to antibodies and nanobodies and including but not limited to multiplexed (4-7) biomarkers and hyperplexed biomarkers (>7 biomarkers); live cell biomarkers multiplexing and/or hyperplexing; electron microscopy, toponome imaging, matrix-assisted laser desorption/ionization mass spectrometric imaging (MALDI MSI), complementary spatial imaging (e.g., FISH, MxFISH, FISHSEQ, or CyTOF), multiparameter ion beam imaging, or in vitro imaging. Targets include, without limitation, tissue samples (human and animal) and in vitro models of tissues and organs (human and animal).
Directional phrases used herein, such as, for example and without limitation, top, bottom, left, right, upper, lower, front, back, and derivatives thereof, relate to the orientation of the elements shown in the drawings and are not limiting upon the claims unless expressly recited therein.
The disclosed concept will now be described, for purposes of explanation, in connection with numerous specific details in order to provide a thorough understanding of the subject innovation. It will be evident, however, that the disclosed concept can be practiced without these specific details without departing from the spirit and scope of this innovation.
The disclosed concept provides a novel technological approach to computational pathology, using explainable AI (xAI). The xAI approach of the disclosed concept permits a completely different relationship between the pathologist and the software tool, one that permits transparency and accountability of the ML tool. The intent is to foster pathologist trust and acceptance of new and powerful computational tools. If an AI tool is to credibly support pathologist work on complex and difficult decisions, then it should be able to provide justifications and data for its conclusions. Having full situational awareness, the pathologist is empowered to make the very best diagnostic decisions that only they can make (e.g., benign versus malignant; high-risk versus low-risk; etc.)
Building on prior work in xAI tools of the present inventors, the disclosed concept provides a framework to parametrically model a dictionary of diagnostically relevant histological patterns and quantify their existence in digital pathology images. This framework will let pathologists visualize clinically relevant tissue structures in a quantitative fashion, given a parametric model. In the non-limiting exemplary embodiment, the disclosed concept accesses a potential dictionary of various histological patterns from several different public sources, including from the World Health Organization (WHO) classifications of tumors, and consultations with teams of different sub-specialist pathologist experts. These sources aid in the assembly of a comprehensive framework of high-level and organ-specific disease entities, including both tumor and non-tumor pathology, including breast, lung, gastrointestinal tract, skin, genitourinary tract, etc. These include: inflammatory infiltrate patterns (crypt abscesses in active colitis or presence of abnormal plasmacytoid dentritic cells in dermis), apoptosis in GI biopsies of transplant patients (GVHD is time-intensive), subtle non-tumor patterns in lung pathology, including looking for organism milieu before special stains are available, finding small tumor mets in almost any solid tumor, and finding incidental lymphomas in lymph nodes taken for solid tumor staging (CLL/SLL most commonly).
Many of these disease entities share feature patterns (e.g., gland formation in adenocarcinomas, duct formation in breast tissue), and others are disease-specific, such as colonic crypt distortion that is diagnostic of chronic mucosal ulcerative colitis. Using pre-existing classifications of disease, the approach of the disclosed concept may be generalized throughout multiple organ systems, thereby providing pathologists with relevant and disease-specific diagnostic recommendations that are explainable and justifiable using xAI techniques.
These guidelines would analytically model a visual pattern dictionary that traditionally defines the standards on tumor classification/nomenclature for pathologists worldwide. The models of the disclosed concept, built on WHO guidelines in the exemplary embodiment, will bring a solution to opaqueness when integrated with transparent and interpretable ML interfaces, hence promoting a better understanding of computational tools and tissue mechanisms.
System 5 includes an input apparatus 10 (such as a keyboard), a display 15 (such as an LCD), and a processing apparatus 20. A user is able to provide input into processing apparatus 20 using input apparatus 10, and processing apparatus 20 provides output signals to display 15 to enable display 15 to display information to the user as described in detail herein (e.g., a segmented tissue image and a classification of a current disease state of certain tissue components in the tissue image). Processing apparatus 20 comprises a processor and a memory. The processor may be, for example and without limitation, a microprocessor (µP), a microcontroller, an application specific integrated circuit (ASIC), or some other suitable processing device, that interfaces with the memory. The memory can be any one or more of a variety of types of internal and/or external storage media such as, without limitation, RAM, ROM, EPROM(s), EEPROM(s), FLASH, and the like that provide a storage register, i.e., a machine readable medium, for data storage such as in the fashion of an internal storage area of a computer, and can be volatile memory or nonvolatile memory. The memory has stored therein a number of routines that are executable by the processor, including routines for implementing the disclosed concept as described herein in various embodiments. In particular, processing apparatus 20 includes a histological structure segmentation component 30 configured for identifying and segmenting histological structures (such as, without limitation, ducts/glands and lumen, clusters of ducts/glands, and individual nuclei) in a number of tissue images represented by the multi-parameter cellular and/or sub-cellular imaging data 25 obtained from various imaging modalities as described herein in the various embodiments (e.g., H&E stained image data). In the non-limiting exemplary embodiment, histological structure segmentation component 30 employs the segmentation approach described in the U.S. Provisional Pat. Application No. 62/990,264, titled “Scalable and High Precision Context Guided Segmentation of Histological Structures” and filed on Mar. 16, 2020, the disclosure of which is incorporated herein by reference. That segmentation approach is able to locate and segment histological components, such as, without limitation, ducts, nuclei, blood vessels, lung alveoli, and colon glands.
Processing apparatus 20 further includes a dictionary of pre-existing diagnostically relevant visual histological patterns 35 that traditionally define the standards for classifying the state of a particular disease, such as breast cancer. In the non-limiting exemplary embodiment, the dictionary of pre-existing diagnostically relevant histological patterns is at least partially obtained from World Health Organization (WHO) Blue Books (including images contained therein). As is known in the art, WHO Blue Books are an essential standards reference for pathologists, clinicians and researchers internationally. The WHO Blue Books specify a body of knowledge regarding how histological patterns for differential diagnoses of diseases, such as tumors, can be described structurally with respect to (1) cell morphology (e.g., round, large, mitotic, etc.), (2) spatial cell organization (e.g., picket-fence, cribriform, etc.), and (3) architectural tissue organization (e.g., tumor infiltrating lymph nodes, fat at tumor boundary, etc.). All these histological patterns can be visually assessed by the pathologist, who typically arrives at a diagnosis by relating patterns in tissue samples to the patterns in the WHO standards.
In addition, processing apparatus 20 further includes a number of parametric feature models that define a parametric feature modeling scheme that are derived from the dictionary of pre-existing diagnostically relevant histological patterns 35. One particular method for creating the number of parametric feature models 40 according to a particular exemplary embodiment is described in connection with
As described elsewhere herein, in the exemplary embodiment, the quantifiable features of the parametric feature modeling scheme may include one or more unary features, wherein each unary feature is a single morphological feature such as the size, shape or spatial spread of tissue components in an image, one or more binary features, wherein each binary feature is a pairwise combination of two unary features, and/or one or more ternary features, wherein each ternary feature is a combination of three or more unary features. Particular examples of such unary, binary and ternary features that may be employed in connection with one or more particular exemplary embodiments of the disclosed concept art described in detail elsewhere herein (
In addition, processing apparatus 20 also includes a tissue classification component 45. As described in detail elsewhere herein, tissue classification component 45 is structured and configured to apply one or more of the parametric feature models 40 to the multi-perimeter cellular and/or sub-cellular imaging data 25 in order to classify the state of a particular disease in a tissue sample represented by the multi-perimeter cellular and/or sub-cellular imaging data 25.
Referring to
Referring to
Thus, as described in connection with
The disclosed concept will now, for illustrative purposes, be described in connection with one particular exemplary embodiment that is an approach for analyzing and classifying breast lesions. More specifically, this particular exemplary embodiment employs a particular parametric feature modeling scheme that is described in detail below in order to allow for the automatic classification of tumors in the relevant breast lesion images. In this particular exemplary embodiment, step 70 of
The particular embodiment for analyzing and classifying breast lesions described herein invokes a parametric feature model(s) 40 for histological patterns within each segmented duct using a mix of unary, binary, and ternary features as shown in
The unary features of this particular embodiment are shown in
The first group of unary features of this embodiment (the “smallness” and “largeness” features) are based on nuclear size (quantified using area), which is known to provide diagnostic cues in pathological grading, with groups of small and large nuclei having a propensity to belong to low-risk and high-risk lesions, respectively, as shown in
The second group of unary features of this embodiment (the “roundness” and “ellipticity” features) are based on nuclear shape, which has been identified as diagnostically meaningful. For example, as shown in
Moreover, several studies have shown that studying the spatial organization of nuclei provides insights into the abnormalities of cells which might eventually lead to malignancy. For instance, the nuclei arrangement in a CCC lesion frequently exhibits crowding and/or overlapping. However, for cases belonging to high-risk atypical lesions (Flat Epithelial Atypia-FEA and ADH), the nuclei tend to be uniform and evenly-spaced Thus, the third group of unary features of this embodiment (the “crowdedness” and “spacedness” features) are based on the spatial organization of nuclei in an image. To quantify the crowding around each nucleus, its average distance to ten nearest nuclei is computed. An analytical model of crowdedness is then constructed by considering local ROIs within a duct where clusters of nuclei show significant crowding behavior and then computing its spatial density. In contrast, to capture evenly spaced/uniform dispersion patterns around a nucleus, the disclosed concept starts by placing a regular grid of size 3×3 centered at a reference nucleus, and measures the density of twenty neighboring nuclei by counting the population of nuclei in each grid cell as described in Sergio Rey, Wei Kang, Hu Shao, Levi John Wolf, Mridul Seth, James Gaboardi, and Dani Arribas-Bel, “pointpats: Point Pattern Analysis in PySAL”, PySAL: The Python Spatial Analysis Library, July 2019. The disclosed concept then compares this observed population against an expected number of nuclei under the complete spatial randomness hypothesis, which asserts the occurrence of points (here nuclei) within grids in a random fashion analogous to a Poisson point process using a an χ2-test statistic and acquiring the corresponding p-value using the χ2 distribution table. The larger the p-value, the greater is the likelihood of observing a uniform/evenly spaced dispersion of nuclei around the reference nucleus.
Although, the unary features described above show some inferential strength (indicated by the hatched bars on top of each feature in
Moreover, some of the diagnostically relevant histological patterns are best represented by a combination of more than two unary features. This embodiment of the disclosed concept, therefore, considers three such ternary features obtained from combinations of more than two unary features (including, but limited to the specific unary features described above), which is shown in
In particular, to determine the largeness-roundness-spacedness feature, the disclosed concept takes z-scores from each unary feature, i.e., largeness, roundness and spacedness, and builds a three-component, three-dimensional mixture of Gaussian model using ground truth examples.
With respect to the cribriform feature, this pattern is characterized by polarization of epithelial cells within spaces formed by “almost” circular multiple lumen (> 2) which are 5-6 cells wide and whose appearance closely resemble “holes in Swiss cheese” This complex architectural pattern can be identified by analytically modeling three (unary) sub-features: clustering coefficient, distance of the nucleus from two nearest lumen, and circularity of the lumen adjacent to the nucleus. The polarization of epithelial cells around the lumen is characterized by a clustering coefficient and is computed by following the method described in Naiyun Zhou, Andrey Fedorov, Fiona Fennessy, Ron Kikinis, and Yi Gao, Large Scale Digital Prostate Pathology Image Analysis Combining Feature Extraction and Deep Neural Network, arXiv preprint arXiv: 1705.02678, 2017, and is illustrated in the middle row of
With respect to the picket-fence feature, this pattern is recognized from a group of crowded elliptical nuclei oriented perpendicular to the basement membrane (lumen). The analytical model of this high-order visual feature can be obtained by constructing parametric models of four simple (unary) sub-features: distance of a nucleus to nearest lumen, nuclear ellipticity, a spread in the angle of major-axis of 10 nearby nuclei, and its local angle with respect to the basement membrane as shown in the last row of
As discussed above, the parametric models for the histological patterns are, in the exemplary embodiment, probability distributions. For example, a cytological feature like nuclear ellipticity for a given nucleus inside a ROI will receive a probability under the mixture of Gaussian models shown in
With respect to a strategy for differential diagnosis, the disclosed concept, in the exemplary embodiment, adopts a non-linear strategy, similar to what expert pathologists do, in that it finds sub-regions within ROI by non-maxima suppression (threshold value of 085 on the likelihood scores) where the evidence for one or more of the unary, binary, or ternary feature is dominating.
Furthermore, having identified dominant unary, binary and ternary feature regions, the disclosed concept, in the exemplary embodiment, uses three descriptive statistics: median value of the likelihood scores of all the nuclei found in each sub-region, median number of nuclei found in each sub-region and the number of sub-regions. This is calculated for each one of the unary, binary and ternary features (total = 16), thereby obtaining a 48 column feature vector for a single image In the exemplary embodiment, feature vectors were computed for all 1441 labeled duct ROIs extracted from whole slide images which resulted in 834×48 size feature map used to train a classifier and 607×48 data matrix for testing. To analyze the benefit of including binary and ternary features, the disclosed concept further slices the 48 column feature vector to be suitable for three scenarios: unary (U) only, unary and binary (U-B), and unary, binary, and ternary features (U-B-T). Due to inherent training and testing class imbalance, which reflects the real-world prevalence statistics of atypical lesions, the disclosed concept up-sampled high-risk examples using the SMOTE technique as described in Chawla N. et al., “Smote: Synthetic Minority Over-sampling Technique”, Journal of Artificial Intelligence Research, 16:321-357, 2002.
In addition, prior to classifying the lesions, the disclosed concept pays close attention to the presence of a cribriform pattern, a symbolic visual primitive of an ADH (a high-risk) category. ROIs predicted to show a cribriform pattern are classified as high-risk, if the number of nuclei forming the cribriform sub-region is greater than 8 (hyper-parameter optimized over the training data). The reduced dataset, devoid of cribriform, is tested for each of the scenarios (U, U-B, and U-B-T) with logistic regression (LR), support vector machine (SVM), random forest (RF), and gradient boosted classifier algorithms. The best model was chosen by optimizing the parameters using GridSearchCV based on precision, recall, and F-scores and then performed a 10-fold stratified cross-validation to check for overfitting.
The approach of this particular embodiment of the disclosed concept as just described, with ~150 parameters (see
While specific embodiments of the invention have been described in detail, it will be appreciated by those skilled in the art that various modifications and alternatives to those details could be developed in light of the overall teachings of the disclosure. Accordingly, the particular arrangements disclosed are meant to be illustrative only and not limiting as to the scope of disclosed concept which is to be given the full breadth of the claims appended and any and all equivalents thereof.
This invention was made with government support under grant # CA204826 awarded by the National Institutes of Health (NIH). The government has certain rights in the invention.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2021/041037 | 7/9/2021 | WO |
Number | Date | Country | |
---|---|---|---|
63049690 | Jul 2020 | US |