The present application relates to a computer-implemented method for identifying at least one class of at least one biological image, notably to predict the genomic signature from biological image(s), in particular to predict Homologous Recombination DNA-repair deficiency (HRD) from biological images of tissues. The present application further proposes a computer-implemented method for visualizing clusters of sub-images or tiles of at least one biological image, in particular to predict the phenotypic feature or combination of phenotypic features (or phenotypic patterns) associated with the genomic signature.
Homologous Recombination DNA-repair deficiency (HRD) is a well-recognized marker of platinum-salt and PARP inhibitor chemotherapies in ovarian cancer and is under evaluation in clinical trials in breast cancers (BC). Causing high genomic instability, HRD is currently determined by BRCA1/2 sequencing or by genomic signatures, but its morphological manifestation is not well understood. Deep Learning is powerful machine learning technique that has been recently shown to be capable of predicting genomic signatures from stained tissue slides. Here, we train a deep-learning model to predict the HRD in a controlled cohort of luminal BC (AUC: 0.83). We present and evaluate a strategy to control for imaging biases in retrospective cohorts and we develop a new visualization technique that allows automatically extracting the morphological features related to HRD. The extracted morphological patterns have been analysed in detail leading to improve the understanding of the phenotypic impact of HRD.
The importance of correcting biases when predicting Homologous Recombination Deficiency (HRD) in breast cancers from Hematoxylin Eosin slides using deep learning is herein demonstrated. the novel interpretation algorithm leads to results illustrative of disease-relevant genotype-phenotype relationships, thus identifying morphological patterns related to HRD and shedding light on its phenotypic consequences.
The advent of Deep Learning has revolutionized biomedical image analysis and in particular digital pathology. Traditionally, the majority of methods developed in this field were dedicated to computer-aided diagnosis, where the objective is to partially automatize human interpretation of slides, in order to help pathologists in their diagnosis task, e.g. the detection of mitoses 1, or the identification of metastatic axillary lymph nodes 2,3. Beyond the automatization of manual inspection, Deep Learning has also been successfully applied to predict patient variables, such as outcome 4, and to predict molecular features, such as gene mutations 5,6, expression levels 7 or genetic signatures 5,8. Despite these results of unprecedented quality, one of the major drawbacks of Deep Learning algorithms is their black-box character: because the features are automatically extracted, it is difficult to know how a decision was made. This has two major consequences: first, it is difficult to identify potential confounders, i.e. variables that correlate with the output due to the composition of the data set and that are predicted instead of the intended output variable. Second, even in the absence of statistical artifacts, understanding how the decision was generated in the first place can point to interesting mechanistic hypotheses and to patterns in the image that have so far been overlooked. One way to overcome the latter problem is to use hand-crafted biologically meaningful features 8. This however requires an extraordinary effort in terms of annotation. Here, we take a conceptually different approach. Instead of working in a pan-cancer setting on a large number of signatures, we concentrate on one single medically highly relevant signature in one cancer type on a controlled data set, where we can correct for potential biases. In order to understand how the Deep Learning decision is generated and which morphological patterns are related to the output variable, we propose a novel visualization and interpretation technique that paves the way to “machine teaching”, i.e. a data driven approach to identify phenotypic patterns related to genomic signatures, thus pointing to new mechanistic hypotheses. In order to demonstrate the power of this strategy, we focus on predicting Homologous Recombination Deficiency (HRD) in Breast Cancer (BC). Worldwide, 2.1 million women are newly diagnosed per year with BC which is a leading cause of cancer-related death. Improvement of metastatic breast cancer treatment is therefore of highest priority. BC is a heterogeneous disease with four major molecular classes (luminal A, B, HER2 enriched, and triple-negative breast cancer [TNBC]) benefiting from different therapeutic approaches. If early BC patients have an overall survival of 70 to 80%, metastatic disease is incurable with a short duration of survival 9. Homologous Recombination (HR) is a major and high-fidelity repair pathway of DNA double-strand breaks. Its deficiency, HRD, results in high genomic instability 10 and occurs through diverse mechanisms, including germline or acquired mutations in DNA repair genes, most frequently BRCA1, BRCA2 or PALB2, or through epigenetic alterations of BRCA1 or RAD51C. Importantly, HRD leads to a high sensitivity to polyADP-ribose polymerase inhibitors (PARPi) in vitro 11,12. PARPi have been shown to improve metastatic breast cancer progression free survival 13. Several methods have been developed to detect HRD, including genomic instability profiling, mutational signatures, or integrating structural and mutational signatures 14-18. However, HRD is currently diagnosed in clinical practice by DNA repair genes sequencing and genomic instability patterns (genomic scar) such as the LST signature 14 or the HRD MyChoice® CDx test (Myriad Genetics). BRCA1 and BRCA2 mutations are known predictive markers for response to PARPi 10 and platinum salt 19 and the somatic HRD has been more recently recognized as a predictive marker for PARPi in ovarian 10 and breast cancer 20. But neither a specific routinely assessed phenotype nor a morphological pattern indicates the presence of HRD. The majority of hereditary BRCA1 cancers are TNBC and up to 60-69% of sporadic TNBC harbor a genomic profile of HRD (Alexandrov et al. 2013; Popova et al. 2012; Chopra et al. 2020). However, the majority of hereditary BRCA2 cancers are luminal (Lakhani et al. 2002) and HRD also exists in sporadic luminal B (Manié et al. 2016; Chopra et al. 2020) or in HER2 tumors (Ferrari et al. 2016; Turner 2017). Of note, germline or sporadic alterations of BRCA harbor indistinguishable genomic alterations in triple-negative or in luminal tumors 21,22. In that context, a reliable and accurate test is mandatory to select patients for PARPi and platinum salt treatments. Whereas the screening for germline or somatic BRCA1 and BRCA2 mutations is feasible for all TNBC (18% of all BC cases), it represents a real challenge in clinical practice if extended to all luminal B tumors (35% of all BC). This strategy moreover does not identify the whole diversity of genetic causes of HRD. In this study, we present an image-based approach to predict HR status from Whole Slide Images (WSI) stained with Hematoxylin Eosin (HE) using deep learning, from a large retrospective series of luminal and triple-negative breast carcinomas with a genomically defined HR status, from a single cancer center. In particular, we show that careful correction for potential biases is essential for such studies and demonstrate the relevance of a correction strategy. Finally, we develop a novel interpretation algorithm that allows the visualization of decisive patterns and leads to new hypotheses on disease-relevant genotype-phenotype relationships.
The principle of the invention is notably illustrated in the
A cancer is a disease involving abnormal cell growth with the potential to invade or spread to other parts of the body. According to the invention, the cancer which affects or affected a patient may be selected from the list consisting of bladder cancer, bone cancer, brain cancer, breast cancer, cervical cancer, colon cancer, esophageal cancer, gastric cancer, head & neck cancers, hodgkin's lymphoma leukemia, liver cancer, lung cancer, melanoma, mesothelioma, multiple myeloma myelodysplastic syndrome, non-hodgkin's lymphoma, ovarian cancer, pancreatic cancer, prostate cancer, rectal cancer, renal cancer, sarcoma, skin cancer, testicular cancer, thyroid cancer or uterine cancer. In a particular embodiment, the cancer which affect or affected a patient is a breast cancer, including breast cancer corresponding to ductal carcinoma, lobular carcinoma, invasive breast cancer, inflammatory breast cancer, metastatic breast cancer, hormone receptor positive breast cancer, hormone receptor negative cancer, HER2 positive breast cancer, HER2 negative breast cancer, triple-negative breast cancer. In a more particular embodiment of the invention, the cancer which affects or affected a patient is a triple-negative breast cancer. Triple-negative breast cancer (TNBC) is cancer that tests negative for estrogen receptors, progesterone receptors, and excess HER2 protein. Thus, triple-negative breast cancer does not respond to hormonal therapy medicines or medicines that target HER2 protein receptors.
A biological marker (or biomarker) is defined as a biochemical, molecular, or cellular alteration that is measurable in biological media such as tissues, cells, or fluids, and that indicates normal or abnormal process of a condition or disease. The term “biomarker” refers to molecule which can be measured accurately and reproducibly, thereby leading to the provision of a “signature” that is objectively measured and evaluated as an indicator of normal biological processes, or pathogenic processes, or pharmacologic responses. In the context of the present invention, a biomarker corresponds to biological molecule(s) expressed by and/or present within cells of a human being. Thus, in the present invention biological markers include genetic biomarkers (corresponding to the transcript products of genes) and epigenetic biomarker (corresponding to methylation of DNA for example). In the present invention, biomarkers include DNA, RNA and proteins. The measure of the expression of the biomarkers leads to the provision of a signature that can be associated with the detection of cancer cells.
A biological sample obtained from the patient can be any biological sample, such tissue, blood, urine, whole cell lysate. Methods of obtaining a biological sample are well known in the art and include obtaining samples from surgically excised tissue. Tissue, blood, urine and cellular samples can also be obtained without the need for invasive surgery, for example by puncturing the subject with a fine needle and withdrawing cellular material or by biopsy. In certain embodiments, samples taken from a patient can be treated or processed to obtain processed biological samples such as supernatant, whole cell lysate, or fractions or extract from cells obtained directly from the patient. In other embodiments, biological samples issued from a patient can also be used with no further treatment or processing. In a preferred embodiment, the biological sample obtained from the subject is a tissue, in particular a tissue from a tumor or a tumor extract, obtained by biopsy or by surgical excision. A biological sample issued from a subject may, for example, be a sample removed or collected or susceptible of being removed or collected from an internal organ or tissue or tumor of said subject, in particular from tumor, or a biological fluid from said subject such as the blood, serum, plasma or urine. A biological sample collected or removed from the subject may, for example, be a sample comprising cancer cells which have been or are susceptible of being removed or collected from a tissue, in particular a tumor, of said subject.
A primary cancer develops at the anatomical site where tumor progression began and proceeded to yield a cancerous mass. Most cancers develop at their primary site but then go on to metastasize: cancer cells from the primary cancer spread to other parts of the body and form new, or secondary, tumors, leading to a metastatic cancer. These secondary tumors are the same type of cancer as the primary cancer also called primary tumor. Most cancers continue to be called after their primary site, as in breast cancer or lung cancer for example, even after they have spread to other parts of the body.
A tumor is an abnormal mass of tissue that forms when cells grow and divide more than they should or do not die when they should. Tumors may be benign (not cancer) or malignant (cancer). Benign tumors may grow large but do not spread into, or invade, nearby tissues or other parts of the body. Malignant tumors can spread into, or invade, nearby tissues. They can also spread to other parts of the body through the blood and lymph systems.
Histopathology is a branch of pathology which deals with the study of disease in a tissue section. It may refer to the examination of a biopsy or a surgical specimen after the specimen has been processed and histological sections have been placed onto appropriate support medium.
A genomic signature or profile (or gene signature or gene expression signature or profile) is a single or combined group of genes in a cell with a uniquely characteristic pattern of gene expression that occurs as a result of an altered or unaltered biological process or pathogenic medical condition.
Homologous recombination (HR) is a type of genetic recombination in which genetic information is exchanged between two similar or identical molecules of double-stranded or single-stranded nucleic acids (usually DNA as in cellular organisms but may be also RNA in viruses). It is widely used by cells to accurately repair harmful breaks that occur on both strands of DNA, known as double-strand breaks (DSB), in a process called homologous recombinational repair (HRR).
Homologous recombination deficiency (HRD) is a phenotype that is characterized by the inability of a cell to effectively repair DNA double-strand breaks using the homologous recombination repair (HRR) pathway. Loss-of-function genes involved in this pathway can sensitize tumors to particular treatments which target the destruction of cancer cells, for example by working in concert with HRD through synthetic lethality.
Homologous recombination proficiency corresponds to a sample exhibiting a normal or near normal level of homologous recombination DNA repair activity.
Homologous recombination (HR) status of the cancer tissue corresponds to the classification of cancer into the group of homologous recombination deficient (HRD) or non-HR deficient (non HRD) (or HR proficient (HRP)).
A large-Scale State transition corresponds to a chromosomal breakage that generates 10 Mb or larger fragments. The quantification of these breaks can be used as a surrogate measure for genomic instability, which may be caused by mutation of DNA repair genes, including BRCA1 or BRCA2.
A molecular subtype or class of cancer is based in the genes the cancer cells express. These genes control how the cell behave. Different cancers of a single organ may behave and grow in different ways. Defining a cancer at the molecular, or smallest cell, allows to further classify cancers relatively to their pattern and behavior instead of their origin. As an example, breast cancer has four primary molecular subtypes, defined in large part by hormone receptors (HR) and other types of proteins involved (or not involved) in each cancer: a) Luminal A or HR+/HER2− (HR-positive/HER2-negative); b) Luminal B or HR+/HER2+ (HR-positive/HER2-positive); c) Triple-negative or HR−/HER2− (HR/HER2-negative); and d) HER2-positive. A fifth subtype, known as normal-like breast cancer, closely resembles luminal A.
A cancer's grade describes how abnormal the cancer cells and tissue look when compared to healthy cells. Cancer cells that look and organize most like healthy cells and tissue are low grade tumors. Some cancers have their own system for grading tumors. Many others use a standard 1-4 grading scale.
A cancer's stage describes how large the primary tumor is and how far the cancer has spread in the patient's body. There are several different staging systems. Many of these have been created for specific kinds of cancers. Others can be used to describe several types of cancer.
Stage 0 to stage IV: one common system that many people are aware of puts cancer on a scale of 0 to IV.
Stage 0 is for abnormal cells that haven't spread and are not considered cancer, though they could become cancerous in the future. This stage is also called “in-situ.”
Stage I through Stage III are for cancers that haven't spread beyond the primary tumor site or have only spread to nearby tissue. The higher the stage number, the larger the tumor and the more it has spread.
Stage IV cancer has spread to distant areas of the body.
A sporadic cancer is a cancer that occurs in people who do not have a family history of that cancer or an inherited change in their DNA that would increase their risk for that cancer.
A germline cancer occurs when cancer is related to a mutation inherited from a parent. Germline mutations, also called hereditary mutations, are passed on from parents to offspring. Inherited germline mutations play an important role in cancer risk and susceptibility.
Major molecular subtypes of breast cancers are summarized in the table below (issued from Eliyatkin N. et al., J Breast Health. 2015 Apr. 1; 11 (2): 59-66. doi: 10.5152/tjbh.2015.1669).
Tumor-infiltrating lymphocytes (TAMs) are white blood cells that have left the bloodstream and migrated towards a tumor. They include T cells and B cells and are part of the larger category of ‘tumor-infiltrating immune cells’ which consist of both mononuclear and polymorphonuclear immune cells, (i.e., T cells, B cells, natural killer cells, macrophages, neutrophils, dendritic cells, mast cells, eosinophils, basophils, etc.) in variable proportions. Their abundance varies with tumor type and stage and in some cases relates to disease prognosis
Necrosis is a form of cell injury which results in the premature death of cells in living tissue by autolysis.
Anisokaryosis corresponds to an inequality in the size of the nuclei of cells.
In an embodiment, the invention concerns a computer-implemented method for identifying at least one class, optionally a biological class, of at least one biological image, comprising the following steps:
In a preferred embodiment, an optional step of selecting at least some of the tiles from the set of tiles, for example by removing the background tiles is present, for example between the first step of dividing and the first step of encoding.
In a preferred embodiment, the class is the genomic signature or profile of the cancer, or a molecular class of cancer, in particular selected from triple negative breast cancer or luminal breast cancer, or the class is selected from the cancer's Grade, or from the gBRCA1/2 status, in particular sporadic or germinal cancer, or from the homologous recombination status of a cancer, in particular breast cancer.
In an embodiment of the invention, it is provided a computer-implemented method for classifying an image comprising the following steps:
In a preferred embodiment, a step of selecting at least some of the tiles from the set of tiles, for example by removing the background tiles is present, for example between the first step of dividing and the first step of encoding.
In a preferred embodiment, the pre-trained model of the encoding step is trained using a self-supervised algorithm, for example using a momentum contrast method.
In a preferred embodiment of the invention, it is provided a method wherein the biological class of the biological image of a cancer tissue obtained from a subject is identified, optionally wherein the class is the genomic signature or profile of the cancer tissue, optionally wherein the class is the homologous recombination (HR) status of the cancer tissue (i.e., homologous recombination deficient (HRD) or non HR deficient ((non HRD) or HR proficient (HRP)), the molecular class and/or the molecular grade, optionally wherein the cancer is breast cancer.
In a preferred embodiment of the invention, it is provided a method wherein the biological class is the genomic tumor (or cancer) profile, notably the Homologous Recombination Deficient (HRD) profile, in particular defined by the presence of a germline BRCA1/2 (gBRCA1/2) mutation or assessed by the Large-scale State Transitions (LST) genomic signature (or LST high) according to Popova et al (14)) or the Homologous Recombination Proficient (HRP) profile, in particular defined as LST low.
In a preferred embodiment of the invention, it is provided a method wherein the neural network is specifically pre-trained on a set of images or sub-images, optionally on a set of images, preferably whole slide images, of a cancer tissue obtained from one or more subjects to classify slide representations between HRD and non-HRD, optionally between HRD and HRP, to the individual tile representations.
In a preferred embodiment of the invention, it is provided a method wherein the images of sub-images are of known class, optionally of known genomic status, optionally of known HR status (HRD or non HRD).
In a preferred embodiment of the invention, it is provided a method wherein when training at least one of the aforementioned models, at least one bias is corrected, for example a bias related to the technique for obtaining the slide represented by said image, for example the fixing technique and/or the impregnation technique, and/or a bias related to a molecular subtype or a molecular class of cancer.
In an embodiment of the invention, it is provided a computer-implemented method for visualizing clusters of sub-images or tiles of at least one biological image, comprising the following steps:
In a preferred embodiment, it is provided a computer-implemented method for visualizing clusters of sub-images or tiles of at least one biological image, comprising the following steps:
In a preferred embodiment, it is provided a computer-implemented method for visualizing clusters of sub-images or tiles of at least one biological image, comprising the following steps:
In a preferred embodiment, it is provided a computer-implemented method for visualizing clusters of sub-images or tiles of at least one biological image, comprising the following steps:
In a preferred embodiment, it is provided a computer-implemented method for visualizing clusters of sub-images or tiles of at least one biological image, comprising the following steps:
In a preferred embodiment, it is provided a computer-implemented method for visualizing clusters of sub-images or tiles of at least one biological image, comprising the following steps:
In a preferred embodiment, it is provided a computer-implemented method for visualizing clusters of sub-images or tiles of at least one biological image, comprising the following steps:
In a preferred embodiment, it is provided a method which further comprises the following steps:
The present invention also concerns a computer-implemented method for identifying a phenotypical feature, or a combination of phenotypical features or phenotypical pattern in a biological image from a subject, wherein said image is examined for assessing the presence of said phenotypical feature or combination of phenotypical features or phenotypical pattern(s) as defined at the step of labelling of the method, and optionally wherein the phenotypical feature is a histopathological feature.
In a preferred embodiment, it is provided a method wherein the biological image is a whole slide image (WSI), or a portion thereof, for example a tile derived from a WSI.
In a preferred embodiment, it is provided a method wherein the image is a visual representation of a body part using a medical technology imaging such as radiology, magnetic resonance imaging, ultrasound, endoscopy, elastography, tactile imaging, thermography, medical photography, nuclear medicine functional imaging techniques as positron emission tomography (PET) and single-photon emission computed tomography (SPECT).
In a preferred embodiment, it is provided a method wherein the image is an image obtained from a tissue of a subject, notably a whole slide image obtained from a tissue of a subject, or an image of a (histo)pathology section, notably digitized image of (histo)pathology section.
In a preferred embodiment, it is provided a method wherein the tissue is a cancer, or tumor, tissue.
In a preferred embodiment, it is provided a method wherein the tissue is derived from a biopsy obtained from the subject, for example a cancer or tumor biopsy, notably biopsy obtained from a needle biopsy, an endoscopic biopsy, or a surgical biopsy.
In a preferred embodiment, it is provided a method wherein the cancer or tumor is selected from cancers or tumors deficient in homologous recombination (HRD).
In a preferred embodiment, it is provided a method wherein the cancer is selected from breast cancers, ovarian cancers, liver cancers, esophageal cancers, lung cancers, head and neck cancers, prostate cancers, colon, rectal, or colorectal cancers, and pancreatic cancers, preferably breast cancers, ovarian cancers, pancreatic cancers and prostatic cancers.
In a preferred embodiment, it is provided a method wherein the cancer or tumor is a primary or a metastatic cancer or tumor, notably wherein the cancer or tumor is primary ovarian or breast cancer or metastatic pancreatic or prostatic cancer.
In a preferred embodiment, it is provided a method wherein the breast cancer is a luminal (luminal A or luminal B) breast cancer, a triple-negative/basal-like breast cancer (TNBC), an HER2-enriched breast, or a normal-like breast cancer, preferably the breast cancer is a luminal A or luminal B breast cancer.
In a preferred embodiment, it is provided a method wherein the training set of images or sub-images is obtained from a set of biological images, optionally from one or more subjects, optionally of one type of cancer, optionally of one molecular type of cancer (notably of luminal breast cancers), optionally of the same type of tissue or biopsy (notably of breast cancer biopsies).
In a preferred embodiment, it is provided a method wherein the training set of images are stratified in sub groups according to various technical features, including in a non-limitative manner, the type of image (preferably whole slide images), the type of staining, the type of tissue fixation, and/or biological features including in non-limiting manner (the sex of the subject, the age of the subject, the type of cancer, notably the molecular sub-type of cancer, the nature of cancer (e.g., primary or metastatic cancer).
In a preferred embodiment, it is provided a method wherein when training the neural network, confounding effect(s), associated with one or more technical features and/or with one or more biological features of the (training) set of images are assessed according to the method illustrated in
In a preferred embodiment, it is provided a method wherein sampling of the training set of images or of the set of tiles is performed before the training of the neural network.
In a preferred embodiment, it is provided a method wherein subgroups of images are selected for specific training of the neural network, optionally wherein the images are whole slide images from stained histopathological section of luminal and triple-negative breast cancers, preferably of luminal breast cancer, optionally wherein the histological sections are stained with Hematoxylin Eosin (HE).
The present invention also concerns a method for identifying the cancer class of an image from a subject comprising the following steps:
In a preferred embodiment, a step of selecting at least some of the tiles from the set of tiles, for example by removing the background tiles is present, for example between the dividing step and the encoding step.
The present invention also concerns a method of stratifying, or classifying a patient comprising the following steps:
In a preferred embodiment, a step of selecting at least some of the tiles from the set of tiles, for example by removing the background tiles, is present, for example between the dividing step and the encoding step.
In a preferred embodiment, the WSI are obtained from fixed HE-stained histological sections.
The present invention also concerns an ex vivo method for classifying a patient having a cancer, in particular a breast cancer, according to its homologous recombination status, comprising identification in a tissue section, preferably stained and more preferably HE stained, of a cancer biopsy or of a digitized image therefore, such as a WSI, of one or more of the following histopathological features:
Identifying one or more of features, preferably at least 2, 3, 4, 5 or 6 of these features in the tissue section of the cancer biopsy or in the image thereof is indicative of a HRD cancer or a HRP cancer, depending on the histopathological features. These features may be analysed according to the methods and results illustrated in the working examples of the inventions, in particular in examples 3 and
These histopathological features are known by the skilled artisan, for example an histopathologist, and each of these features may be characterized by the skilled artisan according to methods known from the art.
Assessing one or more, more preferably all, of the above-detailed histopathological features may performed to perform the following methods:
The present invention also concerns an ex vivo method for classifying cancers according to their HR status comprising identification in a tissue section, preferably stained and more preferably HE stained, of a cancer biopsy or of a digitized image therefore, such as a WSI, of one or more of the following histopathological features:
These histopathological features are known by the skilled artisan, for example an histopathologist, and each of these features may be characterized by the skilled artisan according to methods known from the art.
In a particular embodiment, the patient suffers from a breast cancer.
The present invention also concerns a method of treating a patient suffering from a cancer comprising the steps of:
In a particular embodiment, the patient suffers from a breast cancer.
In a preferred embodiment the method for treating a patient further comprises:
The present invention also concerns a method of predicting patient eligibility to a cancer treatment comprising the steps of:
In a particular embodiment, the patient has a breast cancer.
In a preferred embodiment, DNA damaging agents include, without limitation, inhibitors of poly ADP ribose polymerase, platinum-based chemotherapy drugs (e.g., cisplatin, carboplatin, oxaliplatin, and picoplatin), anthracyclines (e.g., epirubicin and doxorubicin), topoisomerase I inhibitors (e.g., campothecin, topotecan, and irinotecan), DNA crosslinkers such as mitomycin C, and triazene compounds (e.g., dacarbazine and temozolomide).
In a preferred embodiment, synthetic lethality therapeutic approaches typically involve administering an agent that inhibits at least one critical component of a biological pathway that is especially important to a particular tumor cell's survival, in particular PARP inhibitors.
The present invention also concerns a method for determining the prognosis of a patient suffering from a cancer comprising the steps of:
In a particular embodiment, the patient suffers from a breast cancer.
In the present application, we set out to predict the HR status in breast cancer from H&E stained WSI and to analyse the phenotypic patterns related to HRD. The prediction of HRD is an important challenge in clinical practice. The use of PARP inhibitors for breast cancer patients was initiated for metastatic TNBC patients with germline mutations of BRCA1 or BRCA2. However, BRCA2, as well as PALB2 and a minority of BRCA1 cancer patients, develop luminal tumors. The necessity of predicting HRD is therefore not limited to TNBC, but extends also to luminal BC. On the other hand, luminal BCs represent a far more frequent group than TNBC. For this reason, systematic screening of HR gene alterations for luminal cancers will be problematic and, in many countries, even infeasible due to both economic and logistic issues. Therefore, preselection of patients with a high probability of being HR deficient by analysis of WSI is a cost-efficient strategy that has so far only been hampered by the lack of knowledge about HRD specific morphological patterns in luminals. Indeed, only high grades and to a lower extent pushing margins have previously been reported to be associated with HRD. In this context, the identification of HRD from WSI by deep learning and the identification of related morphological patterns could both facilitate the preselection of breast cancers for molecular determination of HRD, which is particularly important for luminal cancers.
The TCGA provides a precious data set to train models for the prediction of genetic signatures from H&E data 5.8. While we obtained promising results for the prediction of HRD on the TCGA dataset in line with previous reports, we found that this result was partly due to the fact that the molecular subtype acts as a biological confounder. This was particularly problematic as we wanted to investigate the morphological signature of HRD. Of note, the existence of biological and technical confounders is presumably not limited to HRD prediction, but may concern many genetic signatures. The use of carefully curated data sets where technical and biological confounders can be controlled for, is thus an important step in investigating the predictability of genetic signatures, as well as the identification of their morphological counterparts.
In most cases, such in-house datasets also contain technical and biological biases, due to the long period during which the dataset is acquired. This motivated us to propose a method to mitigate bias in Computational Pathology workflows, based on strategic sampling. Such strategies are already used in other fields of biomedical imaging but have so far-to the best of our knowledge—not been used in Computational Pathology. We have shown that this approach can successfully mitigate or even eliminate bias. However, the method is limited by the number of variables we can correct for, as well as by the class imbalance it can handle. In some cases, stratification might therefore be preferable. In any case, it is important to be aware of the confounding variables in the data set, whose presence can lead to false expectations and misinterpretation. For this reason, we expect proper treatment of such variables to become a standard in the field.
While bias correction on the TCGA led to a drop in AUC to 0.63, we found that HRD was predictable in our in-house data set of 715 BC patients with an AUC of 0.83. While homogeneous datasets do not reflect the variability between centres and thus limit direct applicability of the trained networks, they allow for controlled feasibility studies, which now need to be complemented by multicentric studies. In addition, we will validate this algorithm in a prospective neoadjuvant clinical trial for which patients' HRD status will be assessed with MyChoice® CDx test (Myriad).
Homogeneous datasets are well suited for the identification of phenotypic patterns linked to disease patterns, even in cases where no such patterns are known a priori, such as in the case for HRD. In order to identify a phenotypic signature related to an output variable (here HRD), we can either use biologically meaningful encodings, also known as human interpretable features (HIF), and infer the most relevant features by analyzing the weights in the predictive model 8, or we can turn to network introspection. The HIF approach relies on detailed and exhaustive annotations of a large number of WSI. For instance, 8 leverage annotations provided by hundreds of pathologists consisting of hundreds of thousands of manual cell and tissue classifications. Here, we provide a new network introspection scheme, relying on the powerful MoCo encodings, trained without supervision directly on histopathology data, and a decision-based tile selection, that allows us to automatically cluster tiles and to relate these clusters to the output variable.
Interestingly, while our approach confirms the recently published finding that necrosis is a hallmark of HRD8 and identifies morphological features common to HRD in TNBC and luminal BC, such as necrosis, high density in TILs and high nuclear anisokaryosis39, it also points to more specific patterns that have so far been overlooked. For instance, we found tiles enriched in carcinomatous cells with clear cytoplasm suggesting activation of specific metabolic processes in these cells. Second, we find intra-tumoral laminated fibrosis as an HRD related pattern. This suggests the hypothesis that cancer-associated fibroblast (CAF) within the stroma of HRD luminal tumors may play a role in the viability and fate of tumor cells. Furthermore, the presence of adipose tissue within the tumor suggests first a different tumor cell density and second a specific balance between CAF and adipocytes in the context of a luminal HRD tumor. The molecular mechanisms achieving these patterns remain to be determined by in vitro models.
Similar to what we have shown here with respect to HRD, the visualization framework we have developed is versatile and can in principle be applied in the context of other genetic signatures. Because the algorithm is fully automated, using the MIL algorithm and its visualization method can constitute a useful tool for the discovery of morphological features related to the predicted genetic signatures. This has the potential to generate new biological hypotheses on the phenotypic impact of these genetic disorders. In order to maximize the benefit for the scientific community, we release the code to train MIL models on WSIs and to create morphological maps as well as tile trajectories publicly and free of charge, and provide detailed documentation.
Altogether, this study provides new and versatile tools for the prediction and phenotypic dissection of genetic signatures from histopathology data. Application to luminal breast cancers allowed us to shed light on the phenotypic consequences of homologous recombination deficiency, and might provide a tool with the potential to impact breast cancer patient care.
In-house dataset (Institut Curie). We retrospectively retrieved a series of 715 patients with HE slides of surgical resections specimens of untreated breast cancer and a genomically known HR status. The series is composed of 309 Homologous Recombination Proficient tumors (HRP) and 406 Homologous Recombination Deficient tumors (HRD). The HRD status was either identified by the presence of a germline BRCA1/2 (gBRCA1/2) mutation or assessed by LST genomic signature according to Popova et al. for the sporadic triple-negative and luminal cancers.
All patients have been treated and followed at the Institut Curie between 1995 and 2020. The patient agreed for the use of tumor samples from their surgical resection specimens for research according to the law. Ethical approval from the Institutional Review Board (Institut Curie breast cancer study group N°DATA190031) was obtained for the use of all specimens. Clinical data have been retrieved from the Institut Curie electronic medical records and saved using Research electronic data capture (REDCap) tools hosted at the Institut Curie.
Public dataset (TCGA). This public dataset is composed of 815 WSI of breast cancer fixed in formalin (FFPE) and stained in H&E. They are available at https://portal.gdc.cancer.gov/. Low-resolution WSI, WSI containing artifacts such as pen marks, tissue-folds and blurred WSI were removed. The final dataset encompasses 691 WSIs. The HR status of the corresponding tumors was obtained using the LST genomic signature14.
Architecture and optimization parameters. Hyperparameters have been set thanks to a random search evaluated through 5-fold nested cross-validation. The benchmark task is the prediction of the molecular class of the TCGA WSIs.
Both the decision module and the tile-scoring module are multilayer perceptrons with batch normalization43 after each hidden layer. The decision module has 3 hidden layers of 512 neurons, the tile-scoring module has 1 hidden layer of 256 neurons. Dropout has been fixed at 0.4, the optimizer is ADAM44 with a learning rate of 3e-3. A batch consists of 16 samples of WSI. A sample of WSI corresponds to a uniform sampling of 300 of its composing tiles. In fact, we observed that this uniform subsampling of the WSIs regularized training as well as diminishes its computational workload. Finally, training is performed during 200 epochs. Training and performance evaluation are done in a 5-fold nested cross-validation framework. Each dataset is split into 5 independent folds. For each of these folds, a validation set is randomly sampled in the complementary ⅘th. A model is trained on the remaining dataset (=⅘*⅘ th of the total dataset). This process is repeated 10 times for each test fold, then the best model is selected according to its validation performances, and finally tested on its test set. Each test and validation set preserves the stratification of the whole dataset with respect to the target variable as well as the confounding variables in case we correct for them. The final performance estimation of the model is the performance averaged over the 5 test performances. During inference time, all the tiles of each WSI are processed.
Strategic sampling is used both for balancing the training dataset with respect to the output variable (T={t1,t2} in the binary case) and to correct for biases (B={b1, b2, b3, . . . bn}). If X is a given WSI sampled from the dataset, then T(X) and B(X) are respectively the target value and the bias value of X. We note |t1| the total number of slides in the dataset labeled with t1, same for |bi|. |t1| is the total number of slides with label value ti and bias value bi. For achieving both, we sample the WSIs X in each batch in a distribution P under which ({T(X)=t1})=({T(X)=t2}) and ({T(X)=t1}∩{b(X)=bi})=({T(X)=t2}∩{b(X)=bi}) for all i.
That is: (X|{T(X)=tj}∩{(X)=bi})∝|tj|/|tjbi| for each i≤n, j≤m.
Strategic sampling is performed on the fly when building the batches. When correction for several confounder simultaneously, B={b1, b2, . . . , bn} and C={c1, c2, . . . , cm}, a new confounder variable is created that takes values in all the pairs combinations of bi and cj: C′=∪i=1n∪j=1m{(bi, cj)} We then apply strategic sampling to correct for C.
The bias score of a confounder variable is the average mutual information between B and the predicted class C(I(BD, CD)) estimated in a dataset D in which the mutual information between B and the target Tis zero. We sample 30 sub-dataset Di in the test set using strategic sampling such that I(BD
The mutual information I(B,P) between two finite random variables B and Q taking value respectively in {b1, bn} and {q1, . . . qm} measures the dependency between B and Q and is defined by:
Because this average is non-negative even when C is not biased (but tends to zero when Di is large), we compute the bias score of an unbiased model predicting the target variable with the same accuracy as the tested model. It serves as a control.
For learning MoCo-v2 representation we used the MoCo repository available at https://github.com/facebookresearch/moco. We randomly used the following transformations: Gaussian blur, crop and resize, color jitter, grayscale, horizontal and vertical symmetries, and finally a color augmentation in the Hematoxylin and Eosin specific space (ref Ruifrok). The training dataset is composed of 5.3e6 images of size 224×224 pixels, or half the Curie dataset at magnification 10×. We used a Resnet18 and trained it for 60 epochs on 4 GPU Nvidia Tesla V100 SXM2 32 Go. We used the SGD optimizer with a momentum of 0.9, a weight decay of 1e-4 and a learning rate of 3e-3. We used a cosine scheduler with warm restart on the learning rate.
The model used to extract the visualizations has been trained on the luminal subset of the Curie dataset (259 WSI). To benefit from the biggest dataset possible, the model has been trained on the whole dataset, without using early stopping nor testing, during 200 epochs. To generate the attention-based visualization, the highest ranked tile with respect to the attention score is extracted, for each WSI. The selected tiles are then labeled according to the label of their WSI of origin.
Concerning the decision-based visualisation, for each WSI the 300 highest ranked tiles with respect to the attention score are selected. Amond this pool of tiles, the 2000 highest ranking tiles with respect to the posterior probability for HRD and HRP are selected. In order to promote diversity in the extracted images, no more than 20 tiles per slide can be selected.
All computations have been done on the GENCI HPC cluster of Jean-Zay.
The most representative HE stained tissue section of the surgical resections specimens of breast cancer from 715 patients with known HR status have been scanned. The series was composed of 309 Homologous Recombination Proficient (HRP) tumors and 406 Homologous Recombination Deficient tumors.
Due to their enormous size, analysis of WSI typically relies on the Multiple Instance Learning (MIL) paradigm23-26. MIL techniques only require slide-level annotations and share the overall architecture (see
The WSI was divided into tile images (dimension: 224×224 pixels) arranged in a grid. Background tiles are removed, tissue tiles are encoded into a feature vector. Instead of using representations trained on natural image databases and unlike most studies in this domain, the self-supervised technique Momentum Contrast (MoCo27; see Methods) has been used. This method consists in training a Neural Network to recognize images after transformations, such as geometric transformations, noise addition and color changes. By choosing the kind and strength of transformations, invariance classes can be imposed, i.e. variations in the input that do not result in different representations. The feature vector of each tile was then mapped to a score by a neural network. The slide representation was obtained by the sum of the individual tile representations, weighted by the learned attention scores23. Finally, the slide representation was classified by the decision module (see
The method of example 1 has been applied to predict HRD from the WSI in the TCGA cohort, and results were obtained (AUC=0.71; see
Training a Neural Network (NN) to predict HRD on the raw data set, we observed an important increase in prediction performances, as compared to the TCGA (AUC=0.88; see
We then devised a sampling strategy that mitigates biasing during training. Bias mitigation is an increasingly important line of research in machine learning. For instance, it is a well-known problem in training predictive models for functional magnetic resonance imaging (fMRI) data, where the age of the patient has been shown to be an important confounder29. While several techniques for bias mitigation exist30-33, a recent comparison34 indicates that strategic sampling is the method of choice if the distribution is not too imbalanced. Strategic sampling aims at ensuring that irrespectively of the composition of the training set, each batch presented to the neural network is composed of roughly the same number of samples for each value combination of output and confounding variable. Correcting for C1 and C2 resulted in a 4-fold reduction of the bias-score in comparison to the uncorrected model and a slightly lower accuracy (AUC=0.85,
In addition to these technical confounders, the molecular subtype of the tumor to be a potential biological confounder has been identified. Successful correction of this biological confounder in the TCGA (see
In order to understand which phenotypic patterns are related to HRD on the WSI, we turned to visualization techniques for NN. The used MIL framework is equipped with an inherent visualization mechanism: the second module of the algorithm, the tile-scoring module, is in fact an attention module that assigns to each tile an attention score that determines how much a given tile will contribute to the slide representation (and thus to the decision). Attention scores are often used for visualization in the field of pathology 3.35-37, either in the form of heatmaps in order to localize the origin of the relevant signals or in the form of galleries of tiles of interest (tiles with highest attention scores).
However, attention scores do not per se extract the tiles that are related to a certain output variable; they just reflect that the tile is to be taken into consideration in the decision. In particular in the case of genetic signatures, where we would expect that the output variables can be related to several morphological patterns, analysing only the attention scores might be limited. In the case of HRD prediction, this intuition is corroborated by the results presented in
Two expert pathologists labeled these clusters. The HRD signal relied on several clusters: HRD tumors present a high tumor cell density, with a high nucleus/cytoplasm ratio and conspicuous nucleoli. They also show regions of hemorrhagic suffusion associated with necrotic tissue. In the stroma, the HRD signal revealed the presence of striking laminated fibrosis and as expected relied on high Tumor-Infiltrating Lymphocytes (TILs) content. Lastly, one large cluster contained a continuum of several phenotypes, namely adipose tissue intermingled with scattered and clear tumor cells, histiocytes, and plasma cells.
In contrast, the HRP signal was mostly carried by one cluster characterized by low tumor cell density, the cells being moderately atypical and tumor cell nests separated from the stroma by clear spaces. Notably, it included a few invasive lobular carcinomas.
Our approach thus suggests that these tissue phenotypes are hallmarks of HRD. In contrast to triple-negative tumors that have been described as high-grade, rich in TILs, with pushing margins39, no specific pathological patterns of luminal breast gBRCA 1 or 2 cancers were identified except a high grade with a frequent absence of tubules formation and pushing margins. However, these features did not allow a robust identification of HRD luminal tumors in clinical practice40-42. Here, thanks to our visualization tool and our unbiased deep-learning analysis, we identified new features linked to HRD in luminal breast cancers. In order to test this analysis result, the TIL density and nuclear grade were evaluated for each tumor of the in-house dataset by an expert pathologist. As predicted by our algorithm, TILs and nuclear grade were positively associated with the HR status of the tumor in the luminal subset (mean TILs HRD: 29, mean TILs HRP: 17, t-test-pvalue: 0.017; mean nuclear grade HRD: 2.7, mean nuclear grade HRP: 2.3, Xi2-pvalue: 1.2 e-6). Our NN works with different internal representations. While the tile representations provided by MoCo permit the emergence of phenotypic similarity clusters (
Number | Date | Country | Kind |
---|---|---|---|
21306055.1 | Jul 2021 | EP | regional |
21306056.9 | Jul 2021 | EP | regional |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/EP2022/071130 | 7/27/2022 | WO |