The present invention relates to a system and method for characterizing intercellular communication and heterogeneity in tumors, and more particularly a method for detecting sub-populations and receptor-ligand states for providing predictive information in relation to cancer and cancer treatment.
There is increasing awareness that tumors may be highly heterogeneous. Intratumoral heterogeneity has impeded the design, development and effective use of targeted therapies in the clinical setting. The origins of heterogeneity may be ascribed to differences in the genetic composition of the cells that constitute the tumor. Different subclones often cooperate in driving tumor growth and invasion. Thus, even directed therapies that target one or a small subset of subclones may not be effective, resulting in recurrence and metastatic disease. Differences in the genetic makeup and/or molecular composition manifest as distinct cell physiologies, some of which may be observable through imaging modalities. However, imaging captures only a birds-eye view and higher resolution at the molecular level may be essential in developing novel therapeutic strategies. Moreover, when using a sample analyzed with mere population (bulk) methods, very important molecules may not be detectable and hence achieving effective therapy planning is quite challenging. With single-cell data we strive to provide the resolution to detect these important signaling molecules in cells (and their defining cell populations) they are expressed in.
Further, cell-cell communication is an essential component in biological processes including development, lineage determination, cell differentiation and signaling. Cell communication may happen within short ranges (paracrine) or long ranges (endocrine) and in some cases, cells may signal to themselves (autocrine). Normal molecular signaling patterns among the many different cell types in order to maintain coordinated function have been studied and described at a tissue level and captured using various methods such as immunohistochemistry and other molecular assays. However, there are no methods describing how to analyze and detect signaling among cell populations in the context of diagnosis and in the context of finding therapy targets that might disrupt intratumoral signaling at the population level. The current thinking is that if there is a receptor—it may be ubiquitously expressed in a majority of cancer cells and represents a hallmark for a certain subtype of cancer.
In cancers, accumulation of genetic aberrations in a subset of cells results in alterations to “normal” roles these cells typically play in their native tissue of origin or organ. Some of these mutations may confer a proliferative growth advantage to these cells and facilitate cancer progression. As the landscape of these mutations expands, the genetic heterogeneity within the tumor increases and so does the diversity of molecular signals associated with the tumorous tissue and its environment. Such heterogeneous tumors, which are composed of multiple sub-populations pose a major challenge to therapy. This is further exacerbated under selection pressures such as chemotherapy resulting in the emergence of subclones that are resistant and may take over the tumor mass. This heterogeneity is difficult to characterize using mere population data collection strategies. The advent of single-cell technologies can be used to overcome some of these challenges to examine the extent and nature of heterogeneity.
Over the past few decades, rapid advances in next-generation sequencing technologies have had tremendous impact on both the volume and quality of genomic and molecular data. Alterations to the genome, small and big, have been implicated in several cancers. These genetic changes often result in changes at the transcript level and comprehensive estimation of transcript abundances can be obtained using RNA sequencing (“RNA-seq”) using Next Generation Sequencing (“NGS”) hardware (produced for example from Illumina, Life Technologies, PacBio, NanoString, etc.). These high-throughput technologies, both genome sequencing and RNA-seq, require a substantial number of cells and provide mere population averages at the overall “bulk” tumor sample (i.e., all the cells are sequenced together). Thus, these data do not reveal inherent stochasticity or systematic variations within the population. Most existing methodologies describing signaling and pathway enrichments typically utilize population data, which preclude comprehensive assessments on intrinsic heterogeneities in the tumor. More recently, the advent of single-cell sequencing has enabled information gathering at the cellular level. Specifically, single-cell RNA-seq technologies allow for sampling transcript abundance in individual cells, providing a glimpse into the underlying molecular heterogeneity and possibly a glimpse into how molecular signaling and communication is altered in tumors.
Unlike population methods, single-cell technologies offer a huge advantage since they enable information gathering at the next level of resolution while retaining cellular heterogeneity information. In heterogeneous tissues such as tumors, the transcript abundances of signaling molecules display considerable variability. Using mere population (bulk) methods, these critical molecules may not be detectable and hence achieving effective therapy planning is quite challenging. On the other hand, using single-cell data provides the resolution to detect these molecules in cells they are expressed in. Thus, signaling and pathway analyses derived from these data provide a unique perspective compared to those obtained from population data. Moreover, perhaps due to the heterogeneity, signaling mechanisms and pathways enriched in tumor cells are expected to be very different from normal molecular signaling patterns observed in different cell types that are required to maintain coordinated function. The observation that different subclones in a tumor may act in synchrony for proliferation and progression of the disease leads to the hypothesis that there is active communication between them. Such communication is mediated through a network of ligands and receptors and these interactions initiate the downstream signaling. The ligand, which is the message is communicable to all the cells in the ecosystem with a suitable receiver, the receptor. In this novel paradigm, the same cell need not synthesize the ligand and its cognate receptor for signaling to be initiated. Depending on the nature of the ligand and its stability, long-range and/or short-range communications between cells in different subclones can be established. In this invention, by leveraging the advantages of single-cell RNA-seq data, a system and a method to identify curated pairs of ligands and their cognate receptors that display checkered expression patterns within a population of cells from a tumor is disclosed.
Although single-cell technologies can be used to overcome some of the challenges in examining the extent and nature of heterogeneity, there is still a need for methods to detect and analyze signaling among cell populations in the contexts of precision diagnostics, identifying therapy targets, and precision oncology which includes individualized therapy planning for patients. Within this context, there is a need for methods for identifying sub-populations where intratumoral signaling occurs as well as identifying curated receptor-ligand pairs that may be implicated in the signaling. Hence, a framework that provides insight into intercellular communication is needed. Accordingly, a method that leverages the advantages of single-cell RNA-seq, into a quantization procedure for detecting receptor-ligand states as a measure of heterogeneity in the form of a transforming the RNA-seq data into a receptor-ligand communication map (ReLiCoMap) providing insight into intercellular communications, would be advantageous. In particular, a system for identifying cellular sub-populations and the likely receptor-ligand interactions that govern intercellular communication between the sub-populations by transforming the RNA-seq data into a receptor-ligand communication map (ReLiCoMap) would be advantageous.
In particular, it may be seen as an object of the present invention to provide a framework that solves the above-mentioned problems of the prior art to provide a system that utilizes single-cell RNA-seq. data into a quantization procedure for receptor-ligand states as a measure of heterogeneity. It is also an object of the present invention to provide a system that provides a visualized receptor-ligand display, such as a receptor-ligand map, or ReLiCoMap (as used herein) for identifying cellular sub-populations and the likely receptor-ligand interactions that govern intercellular communication between the sub-populations. It is a further object of the present invention to provide an alternative to the prior art.
Thus, the above-described object and several other objects are intended to be obtained in a first aspect of the invention by system that transforms normalized single cell RNA-seq data into a ReLiCoMap through quantization of receptor-ligand states, to identify components of intercellular communication, such system comprising:
a sequencer for providing RNA-seq data from a plurality of individual cells of a cancer or tumor;
an interface in communication with the sequencer for receiving the RNA-seq data from the sequencer, said interface configured to perform the steps of;
an output device, in communication with the storage database through an interface, wherein said output device is configured to obtain the identified sub-cluster and receptor-ligand data from the storage device, and to provide a display in the form of a heatmap or ReLiCoMap from which a clinician can identify receptor-ligand pairs that may be involved in cellular communication within the cancer or tumor.
In addition, the above-described object and several other objects are intended to be obtained in a first aspect of the invention by providing a method for utilizing normalized single cell RNA-seq data, to quantize receptor-ligand states to identify components of intercellular communication, such method comprising the steps of:
obtaining a plurality of normalized single-cell data sets generated by RNA-seq. of a plurality of single cells from a tumor;
obtaining a plurality of data sets from a curated gene list of receptor-ligand pairs;
normalizing the data from the curated gene list;
selecting genes by extracting their normalized transcript abundance data;
assigning a state (e.g., one of four states {0,1,2,3} to each curated receptor-ligand pair in each cell depending on binarized levels of ligand and receptor (e.g., {R:L}={0:0, 0;1, 1;0, 1;1}), thereby forming a matrix of receptor-ligand states;
extracting sub-groups from the matrix that are invariant; and
applying unsupervised clustering methods and identifying sub-clusters in the data.
The various steps of the invention may in certain instances be interchanged or combined as is understandable from the principles of the invention.
In an advantageous embodiment, the invention may be utilized for therapy planning and diagnostics applications. For instance, in serially sampled and analyzed biopsy samples, our tool provides a way for physicians to evaluate the efficacy of current treatment and, provides a way for the physician to design and plan therapy strategy. It also allows for physicians to make suitable changes to the treatment strategy informed by results from our tool. Additionally, our tools aid researchers and practitioners in hypotheses generation to screen candidate genes (ligand and receptors) that can be targeted by existing drugs and test response to these drugs in cell cultures and xenograft models.
In the context of the present invention, the term “curated gene list” is taken to mean the gene list reported by J. A. Ramilowski et al., “A draft network of ligand-receptor-mediated multicellular signalling in human,” Nat. Commun., vol. 6, p. 7866, 2015.
In the context of the present invention, the term “curated receptor-ligand pair” is taken to mean curated list of 1894 receptor-ligand pairs reported by J. A. Ramilowski et al., “A draft network of ligand-receptor-mediated multicellular signalling in human,” Nat. Commun., vol. 6, p. 7866, 2015.
In the context of the present invention, the term “invariant” is taken to mean values that show variation below quantile level of 0.75 of the overall standard deviation of the same values.
In the context of the present invention, the term “unsupervised clustering” or “unsupervised clustering method” is taken to mean clustering methods including but not limited to hierarchical clustering performed without any prior label information. See, e.g., Gareth, J. et al., “An Introduction to Statistical Learning,” Springer; 1st ed. 2013, Corr. 5th printing 2015 edition (Aug. 12, 2013).
In the context of the present invention, the term “sub-population” is taken to mean, as an example, a partitioning of cells identified to indicate a sub-structure in the data using the R package NbClust. This package provides different indices for determining the optimal number of clusters in a data set and offers the best clustering scheme from different results to the user. However, multiple algorithms for unsupervised learning can be used in the present invention instead of hierarchical clustering, such as k-means or PAM. In addition to NbClust, there are other methods that can be used for determining the number of clusters such as the elbow, the silhouette and gap statistic method.
In the context of the present invention, the term “sub-clusters” is taken to mean a high confidence smaller cluster comprising a larger cluster. Confidence is estimated using resampling methods such as, for example, multiscale bootstrap (R package pvclust).
In the context of the present invention, the term “checkered expression” is taken to mean values that are not constant or nearly constant across all observations.
In the context of the present invention, the term “normalized expression” is taken to mean transcript abundances that are scaled for total reads from the sequencing experiment and scaled for length of the transcript. This permits comparison of transcript abundances between genes and across experiments.
In the context of the present invention, the term “ReLiCoMap” refers to a receptor-ligand communication visualization tool that displays the results obtained in a form that allows a physician or clinician to identify the receptor-ligand pairs likely to be involved in cellular communication and to devise appropriate therapies.
According to a further aspect of the invention, a method is provided for utilizing normalized single cell RNA-seq data in the estimation of sub-clusters in genes, to identify components of intercellular communication, such method comprising the steps of:
obtaining a plurality of normalized single-cell data sets generated by RNA-seq. of a plurality of single cells from a tumor;
obtaining a plurality of data sets from a curated list of receptor-ligand pairs;
normalizing the data from the curated gene list;
selecting genes into a gene set by extracting their normalized transcript abundance data, and removing genes that exhibit low variation in transcript abundance from the set;
identifying sub-populations within the set based on pair-wise distances between individual cells and similarity of cellular transcriptomes; and
identifying expressed ligands and receptors across the sub-populations and cross-referencing against the curated set of receptor-ligand pairs.
The methods according to the invention will now be described in more detail with regard to the accompanying figures. The figures showing ways of implementing the present invention and are not to be construed as being limiting to other possible embodiments falling within the scope of the attached claims.
The present invention provides a system and methods for quantization of receptor-ligand states to identify components of intercellular communication and a method for sub-population detection and identification of likely receptor-ligands that orchestrate intercellular communication between sub-populations. The present invention is described in further detail below with reference made to
According to an embodiment of the present invention, a first process of quantization of receptor-ligand states to identify components of intercellular communication is set forth by the steps outlined in
As an illustration of this embodiment, we utilized single-cell RNA-seq data obtained from five glioblastoma multiforme (GBM) patients, published by A. P. Patel et al., Single-cell RNA-seq highlights intratumoral heterogeneity in primary glioblastoma,” Science, vol. 344, no. 6190, pp. 1396-1401, June 2014. We downloaded raw data, mapped and quantified normalized transcript abundance in 430 cells from five patients. Each patient sample included 70 to 118 cells. Receptor-ligand pairs were selected from a curated list of 2557 receptor-ligand pairs reported by J. A. Ramilowski et al. For expression thresholding, a threshold of 16 FPKM units was selected to binarize the data (Step 1b). It should be noted that this threshold is arbitrary but is best chosen keeping in mind constraints imposed by molecular biology on detectability of transcript abundance.
The second step, Step 2 of
In this illustration of this embodiment, each curated receptor-ligand pair in each cell was classified into one of four states (0,1,2,3), depending on whether ligand and receptor expression was 0 or 1. For the purposes of clustering, receptor-ligand pairs quantized at states 1 or 2 (i.e., either of the ligand's or receptor's expression was 0 but not both) were considered to be equidistant from the other two states (i.e., 0 and 3). From this, a quantized-state matrix for receptor-ligand pairs in 430 cells is obtained. We use a heatmap along with hierarchical clustering to visualize this data for each patient. In the heatmap for patient 5 (see
The steps of a further embodiment of the within invention are outlined in
In Step 2 of this embodiment,
Step 3 of this embodiment entails clustering and extracting of sub-groups. This step comprises identification of expressed ligands and receptors across the sub-populations and cross-referencing them against the curated set of receptor-ligand pairs. A gene (ligand or receptor) is considered to be “ON” (i.e., expressed) in a sub-population of cells if its transcript abundance is more than a chosen threshold εth in a majority of cells comprising that partition (same threshold as described in the previous embodiment). Genes that are considered “ON” are given a value 1 and others, 0. From the curated set of receptor-ligand pairs, ligands and their cognate receptors that are likely communicating between the sub-populations are identified.
Illustrating Step 3 in this embodiment, once the cells were partitioned into sub-populations, we used binarized expression data to identify ligands and receptors that were expressed in the different sub-populations. We considered a gene (ligand or receptor) to be “ON” if their transcript abundance is above a chosen threshold (16 FPKM units), in at least a majority of the cells that make up the sub-population. We matched each potential receptor-ligand pair that is “ON” across different sub-populations with the curated set of receptor-ligand pairs (Ramilowski et al.) to identify the putative receptor-ligand pair(s) that underlie intercellular communication between the sub-populations identified (subset shown in Table 2 for one iteration of the procedure). This cross-referencing helps reduce potential false-positives but at the same time may limit the identification of novel, previously uncharacterized receptor-ligand pairs. It should be noted that there are clear common elements discovered (e.g., EGFR, MDK) that are common to both embodiments described here. We can now cross-reference against a list of drugs that target these molecules and present such a list to the physician who can then determine and ultimately decide on their suitability. For example, well known EGFR inhibitors including lapatinib (typically used for breast cancer) and erlotinib (typically used for some lung cancers) are FDA approved drugs for other cancers, not GBM as is the case in our illustration. However, providing this information and information on other relevant drugs that are in different stages of clinical trials, the physician presents an important data point for potential treatment.
Table 2, below, shows a communication map of possible receptor-ligand pairs in an individual patient. This kind of a communication map could represent a communication signature for the tumor of an individual patient to be used both for diagnostic and therapy planning purposes.
We visualize this communication map in
In yet another aspect of the invention, single-cell RNA-seq. is obtained from the sequencing machine (or any other source) as fastq files (or other suitable format). The pipeline execution engine is run to map the sequencing data by aligning the short reads to a reference sequence such as a reference genome or by stitching together a large number of these short read sequences to form a longer contiguous region. Auxiliary Bio databases provide these pieces of information including but not limited to the reference genome sequence, gene models and their coordinates in the genome. The results from these processing steps are stored in the Results storage database and can be accessed by the user at any time. The Interactome Map Module provides methodologies, such as a ReLiCoMap, to discover potential interactions from the sequencing data. A reference database like the Interactions database provides a cross-referenceable system that limits the scope of interactions to those that have been curated. The Interactome Map Execution Engine constructs the interactome map results which are stored. These results are cross-referenced with the Clinical database module to identify any actionable clinical information that may be of benefit to a clinician. This information may include a list of FDA approved drugs or those in clinical trials that are known to target components identified by the Interactome Map Execution Engine. The results are then visualized through a graphical user interface.
The within invention finds application in diagnostics and therapy planning approaches. It enables a physician to uncover potential signaling that occurs within these tumors and devise combination treatment strategies to disrupt them. In situations where serial temporal sampling is possible, it gives physician scientists the ability to monitor the signaling patterns that emerge as a result of their treatment. This provides an opportunity to closely monitor treatment and allow for a course-correction in the treatment regimen if necessary. Upon identifying important receptor-ligand interactions, we cross-reference each component against a database of approved drugs by the FDA for any disease. These options may be used to provide therapy planning choices.
While there have been shown and described and pointed out fundamental novel features of the invention as applied to preferred embodiments thereof, it will be understood that various omissions and substitutions and changes in the form and details of the devices and methods described may be made by those skilled in the art without departing from the spirit of the invention. For example, it is expressly intended that all combinations of those elements and/or method steps which perform substantially the same function in substantially the same way to achieve the same results are within the scope of the invention. Moreover, it should be recognized that structures and/or elements and/or method steps shown and/or described in connection with any disclosed form or embodiment of the invention may be incorporated in any other disclosed or described or suggested form or embodiment as a general matter of design choice. It is the intention, therefore, to be limited only as indicated by the scope of the claims appended hereto.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/EP2017/058322 | 4/7/2017 | WO | 00 |
Number | Date | Country | |
---|---|---|---|
62323288 | Apr 2016 | US |