In general, complex tumor tissue (or other diseased tissue) may comprise a population of tumor cells and a tumor microenvironment (TIME) which may include, for example, immune cells, fibroblasts, and extracellular matrix proteins.
Some embodiments provide for a method for using machine learning to estimate tumor expression levels of genes in tumor cells in a biological sample of a subject having cancer, the biological sample comprising the tumor cells and tumor microenvironment (TME) cells, the method comprising: obtaining expression data for a set of genes, the set of genes comprising a first plurality of genes associated with the tumor cells and a second plurality of genes associated with the tumor microenvironment cells, the expression data comprising first total expression levels for genes in the first plurality of genes and second total expression levels for genes in the second plurality of genes; determining the tumor expression levels of the first plurality of genes in the tumor cells using a plurality of machine learning models, the plurality of machine learning models comprising a respective machine learning model for each gene in the first plurality of genes including a first machine learning model for a first gene in the first plurality of genes, the tumor expression levels including a first tumor expression level for the first gene in the tumor cells, the determining comprising: generating a first set of features for the first gene, the generating including: obtaining, using the expression data, an initial expression level estimate of the first gene in the tumor cells of the biological sample and including the initial expression level estimate of the first gene in the first set of features; including at least some of the first total expression levels in the first set of features; and including at least some of the second total expression levels in the first set of features; providing the first set of features as input to the first machine learning model to obtain an output indicative of a TME expression level estimate of the first gene in the TME cells; and determining the first tumor expression level for the first gene in the tumor cells using the output of the first machine learning model and a total expression level, in the first total expression levels, for the first gene; and outputting the tumor expression levels of the first plurality of genes in the tumor cells.
Some embodiments provide for a system, comprising: at least one processor; at least one non-transitory computer-readable storage medium storing processor-executable instructions that, when executed by the at least one processor, cause the at least one processor to perform a method for using machine learning to estimate tumor expression levels of genes in tumor cells in a biological sample of a subject having cancer, the biological sample comprising the tumor cells and tumor microenvironment (TME) cells, the method comprising: obtaining expression data for a set of genes, the set of genes comprising a first plurality of genes associated with the tumor cells and a second plurality of genes associated with the TME cells, the expression data comprising first total expression levels for genes in the first plurality of genes and second total expression levels for genes in the second plurality of genes; determining the tumor expression levels of the first plurality of genes in the tumor cells using a plurality of machine learning models, the plurality of machine learning models comprising a respective machine learning model for each gene in the first plurality of genes including a first machine learning model for a first gene in the first plurality of genes, the tumor expression levels including a first tumor expression level for the first gene in the tumor cells, the determining comprising: generating a first set of features for the first gene, the generating including: obtaining, using the expression data, an initial expression level estimate of the first gene in the tumor cells of the biological sample and including the initial expression level estimate of the first gene in the first set of features; including at least some of the first total expression levels in the first set of features; and including at least some of the second total expression levels in the first set of features; providing the first set of features as input to the first machine learning model to obtain an output indicative of a TME expression level estimate of the first gene in the TME cells; and determining the first tumor expression level for the first gene in the tumor cells using the output of the first machine learning model and a total expression level, in the first total expression levels, for the first gene; and outputting the tumor expression levels of the first plurality of genes in the tumor cells.
Some embodiments provide for at least one non-transitory computer-readable storage medium storing processor-executable instructions that, when executed by at least one computer hardware processor, cause the at least one computer hardware processor to perform a method for using machine learning to estimate tumor expression levels of genes in tumor cells in a biological sample of a subject having cancer, the biological sample comprising the tumor cells and tumor microenvironment (TME) cells, the method comprising: obtaining expression data for a set of genes, the set of genes comprising a first plurality of genes associated with the tumor cells and a second plurality of genes associated with the TME cells, the expression data comprising first total expression levels for genes in the first plurality of genes and second total expression levels for genes in the second plurality of genes; determining the tumor expression levels of the first plurality of genes in the tumor cells using a plurality of machine learning models, the plurality of machine learning models comprising a respective machine learning model for each gene in the first plurality of genes including a first machine learning model for a first gene in the first plurality of genes, the tumor expression levels including a first tumor expression level for the first gene in the tumor cells, the determining comprising: generating a first set of features for the first gene, the generating including: obtaining, using the expression data, an initial expression level estimate of the first gene in the tumor cells of the biological sample and including the initial expression level estimate of the first gene in the first set of features; including at least some of the first total expression levels in the first set of features; and including at least some of the second total expression levels in the first set of features; providing the first set of features as input to the first machine learning model to obtain an output indicative of a TME expression level estimate of the first gene in the TME cells; and determining the first tumor expression level for the first gene in the tumor cells using the output of the first machine learning model and a total expression level, in the first total expression levels, for the first gene; and outputting the tumor expression levels of the first plurality of genes in the tumor cells.
In some embodiments, the plurality of machine learning models includes a second machine learning model for a second gene in the first plurality of genes and the tumor expression levels include a second tumor expression level for the second gene in the tumor cells, wherein the second machine learning model is different from the first machine learning model and wherein the second gene is different from the first gene. In some embodiments, determining the tumor expression levels of the first plurality of genes in the tumor cells further comprises: generating a second set of features for the second gene; providing the second set of features as input to the second machine learning model to obtain an output indicative of a TME expression level estimate of the second gene in the TME cells; and determining the second tumor expression level for the second gene in the tumor cells using the output of the second machine learning model and a total expression level, in the first total expression levels, for the second gene.
In some embodiments, generating the second set of features for the second gene comprises: obtaining, using the expression data, an initial expression level estimate of the second gene in the tumor cells of the biological sample and including the initial expression level estimate of the second gene in the second set of features; including at least some of the first total expression levels in the second set of features; and including at least some of the second total expression levels in the second set of features.
In some embodiments, the plurality of machine learning models includes a third machine learning model for a third gene in the first plurality of genes and the tumor expression levels include a third tumor expression level for the third gene in the tumor cells, wherein the third machine learning model is different from the first machine learning model and from the second machine learning model, wherein the third gene is different from the second gene and from the first gene. In some embodiments, determining the tumor expression levels of the first plurality of genes in the tumor cells further comprises: generating a third set of features for the third gene; providing the third set of features as input to the third machine learning model to obtain an output comprising a TME expression level estimate of the third gene in the TME cells; and determining the third tumor expression level for the third gene in the tumor cells using the output of the third machine learning model and a total expression level, in the first total expression levels, for the third gene.
In some embodiments, generating the first set of features for the first gene further comprises: obtaining, using the expression data, a first plurality of RNA percentages for a respective plurality of types of cells that occur in the TME, wherein each of the first plurality of RNA percentages indicates a percent of RNA associated with the first gene and originating from cells of a respective type in the TME in the biological sample.
In some embodiments, generating the first set of features for the first gene further comprises including at least some of the first plurality of RNA percentages in the first set of features.
In some embodiments, obtaining the first plurality of RNA percentages comprises processing at least some of the expression data using at least one non-linear regression model.
In some embodiments, the TME cells comprise TME cells of a first type and TME cells of a second type. In some embodiments, the at least some of the expression data includes a first subset of the expression data and a second subset of the expression data. In some embodiments, the at least one non-linear regression model includes a first non-linear regression model and a second non-linear regression model different from the first non-linear regression model. In some embodiments, obtaining the first plurality of RNA percentages comprises: processing the first subset of the expression data using the first non-linear regression model to obtain a first RNA percentage for the TME cells of the first type; and processing the second subset of the expression data using the second non-linear regression model to obtain a second RNA percentage for the TME cells of the second type.
In some embodiments, the first type and the second type are each selected from the group consisting of B cells, CD4+ T cells, CD8+ T cells, endothelial cells, fibroblasts, lymphocytes, macrophages, monocytes, NK cells, and neutrophils, wherein the first type is different from the second type.
In some embodiments, obtaining the initial expression level estimate of the first gene in the tumor cells of the biological sample comprises: obtaining an average TME expression level of the first gene for each of the plurality of types of cells that occur in the TME; determining a weighted sum of the obtained expression levels based on the first plurality of RNA percentages; and subtracting the weighted sum from the total expression level for the first gene to obtain the initial expression level estimate.
Some embodiments further comprise obtaining, using the expression data, a first RNA percentage for the tumor cells, wherein the first RNA percentage indicates a percent of RNA associated with the first gene and originating from the tumor cell of the biological sample.
In some embodiments, determining the first tumor expression level for the first gene in the tumor cells further comprises: subtracting the TME expression level estimate from the total expression level for the first gene; and dividing a result of the subtracting by the first RNA percentage.
In some embodiments, the expression data has been previously obtained at least in part by sequencing the biological sample of the subject having cancer.
In some embodiments, the at least some of the first total expression levels included in the first set of features include total expression levels for at least 25 genes in the first plurality of genes associated with the tumor cells. In some embodiments, the plurality of machine learning models comprises at least 25 machine learning models corresponding to the at least 25 genes.
In some embodiments, each machine learning model of the at least 25 machine learning models comprises a different gradient boost model.
In some embodiments, the at least some of the first total expression levels included in the first set of features include total expression levels for at least 10 genes selected from genes listed in Table 1. In some embodiments, the at least some of the first total expression levels included in the first set of features include total expression levels for at least 25 genes selected from genes listed in Table 1. In some embodiments, the at least some of the first total expression levels included in the first set of features include total expression levels for at least 50 genes selected from genes listed in Table 1. In some embodiments, the at least some of the first total expression levels included in the first set of features include total expression levels for at least 75 genes selected from genes listed in Table 1.
In some embodiments, the first machine learning model of the plurality of machine learning models is a gradient boosted model.
Some embodiments further comprise training the first machine learning by: obtaining training data comprising simulated expression data for genes in the set of genes, wherein the training data is associated with one or more biological samples; generating, using the training data, a training set of features for the first gene; training the first machine learning model to estimate a TME expression level of the first gene, the training comprising: providing the training set of features as input to the first machine learning model to obtain an output comprising an estimate of the TME expression level of the first gene in the TME cells of the one or more biological samples; and updating parameters of the first machine learning model using the estimate of the TME expression level.
In some embodiments, generating the training set of features for the first gene comprises: obtaining, using the simulated expression data, an initial expression level estimate of the first gene in tumor cells of the one or more biological samples and including the initial expression level estimate in the training set of features; and including at least some of the simulated expression levels in the training set of features.
In some embodiments, the first machine learning model was trained at least in part by generating training data comprising simulated expression data, wherein generating the training data comprises: obtaining training expression data for each of one or more biological samples, the training expression data comprising first training expression levels for the first plurality of genes and second training expression levels for the second plurality of genes; generating first simulated expression data using the first training expression levels; generating second simulated expression data using the second training expression levels; and combining the first simulated expression data and the second simulated expression data to produce at least part of the simulated expression data.
Some embodiments further comprise identifying at least one anti-cancer therapy for the subject based on the first tumor expression level for the first gene in the tumor cells.
Some embodiments further comprise administering the at least one anti-cancer therapy.
In some embodiments, the at least one anti-cancer therapy is selected from the group of therapies for the first gene listed in Table 3.
In some embodiments, identifying the at least one anti-cancer therapy for the subject comprises: determining whether the first tumor expression level satisfies at least one criterion associated with the first gene; and after determining that the first tumor expression level satisfies the at least one criterion, selecting the at least one anti-cancer therapy from the group of therapies listed for the first gene in Table 3.
The inventors have developed machine learning techniques for estimating expression levels of genes in tumor cells (which may be referred to herein as “tumor expression levels”) in a biological sample (e.g., such as a sample from a tumor or other diseased tissue) based on expression data (e.g., data obtained, in part, by sequencing the biological sample, for example, using bulk RNA-sequencing). In some embodiments, the techniques involve using multiple machine learning models to estimate respective expression levels of the genes in the tumor microenvironment (TME) cells (which may be referred to herein as “TME expression levels”) of the biological sample. For example, in some embodiments, a different machine learning model may be used to estimate a respective TME expression level for each gene. In some embodiments, the outputs of the machine learning models may be used to determine respective tumor expression levels for genes in the tumor cells of the biological sample.
The inventors have appreciated that expression of particular genes by tumor cells may be used to inform tumor diagnosis, monitor disease progression, inform treatment decisions, and identify clinically-relevant biomarkers. For example, expression levels of a gene in tumor cells may be used to determine whether the tumor is of a particular type of cancer. For example, over-expression of the insulin-like growth factor 2 (IGF2) gene by tumor cells is a feature of hepatoblastoma. If the expression levels of the IGF2 gene in tumor cells are relatively high (e.g., the IGF2 gene is over-expressed), this may indicate that the tumor is of the hepatoblastoma type. Such information can be used to identify drugs known to effectively treat hepatoblastoma, to inform whether to initiate or adjust therapy, and to inform other clinical decisions related to the care of the patient. Of course, this example use of the expression levels of IGF2 should be employed only when the expression levels of IGF2 may be estimated with sufficient accuracy.
Expression levels of a gene in tumor cells may also be used to identify an effective treatment or therapy for the tumor. For example, expression of the CDK2 (cyclin dependent kinase 2) gene by tumor cells has been shown to permit immortalization of tumor cells. Due to this functionality, the CDK2 gene has been identified as a target for mechanism-based therapeutic strategies in cancer treatment. Therefore, if a patient's tumor cells are shown to express the CDK2 gene, this may indicate that the mechanism-based therapeutic strategies will effectively treat the tumor, and such therapeutic strategies may be administered to the patient.
The inventors have further recognized and appreciated that bulk sequencing, which can provide information about tens of thousands of genes in a biological sample simultaneously, can allow for the detection of a signal that represents the combined contribution of multiple cell types, including tumor cells and tumor microenvironment cells. However, the inventors have recognized that total expression data of this kind does not yield information regarding the origin of individual RNA or DNA molecules, such that there remains a significant challenge with estimating the expression level of a gene in tumor cells when that same gene is also simultaneously expressed by one or more types of TME cells. For example, PTK7 (protein tyrosine kinase 7), CCDN2 (Cyclin D2), CDK2, and IGF2 are just a few of the many genes that can be simultaneously expressed by both tumor and TME cells. Since the tumor expression of a gene can inform important decisions relating to diagnosis, prognosis, and treatment of the tumor, the inventors have recognized and appreciated that it is critical to distinguish between tumor and TME expression of genes.
Additionally, the inventors have recognized and appreciated that tumor cells may make up only a relatively small percentage of complex tumor tissue as a whole, with percentages sometimes below 10%. Measuring expression of small cell populations from bulk RNA-seq data can be especially challenging because of the reduced signal-to-noise ratio—if were to consider expression levels of tumor cells as the “signal” and expression levels of TME cells as “noise.” Moreover, because TME cellular transcripts may comprise the majority of the total transcripts in the tumor, this may lead to biases during clinical decision-making and biomarker development.
Various techniques have been employed in an attempt to estimate tumor expression of genes in a biological sample. However, such techniques have limitations and do not adequately address the above-identified issues associated with tumor expression estimation. In particular, conventional techniques involve: (a) predicting the TME expression of a gene in a biological sample based on average TME expression levels of the gene across multiple samples; and (b) subtracting the TME expression of the gene from the total expression of the gene to estimate the tumor expression of the gene. Conventional techniques for predicting the TME expression of the gene involve obtaining the average expression levels of the gene in different TME cell populations and scaling the average expression levels by a respective fraction of each of the TME cell populations. However, using average expression levels of a gene introduce inaccuracies into the predicted TME and tumor expression levels of the gene because the average levels, by definition, are not particular to an individual tumor sample—they are obtained as averages of data collected from sequencing multiple diverse samples. On the other hand, cells (e.g., tumor and TME cells) react to different environments, meaning their gene expression levels differ based on their surrounding environment. Accordingly, the average expression levels of a gene do not accurately reflect the tumor and TME expression levels of that gene in a particular tumor sample for a particular patient.
Due to the limitations in their accuracy, the output of conventional techniques cannot be used to reliably inform clinical decision making or to identify clinically-relevant biomarkers. For example, because of their reliance on average expression levels of individual genes, conventional techniques will underestimate the expression level of a gene that is uniquely, highly-expressed in TME cells of a particular tumor. Rather, the conventional techniques will inaccurately attribute this expression to tumor cells in the tumor. This could lead to, among other problems, inaccurate diagnosis, selection and administration of an ineffective treatment, and inaccurate identification of the gene as a clinically-relevant biomarker.
To address the drawbacks of conventional techniques of tumor expression estimation, the inventors have developed machine learning techniques that account for the unique expression of a particular tumor. In particular, the inventors have developed systems and methods for using machine learning to estimate tumor expression levels of genes in tumor cells in a biological sample of a subject having cancer. The developed techniques include: (a) obtaining expression data (e.g., RNA and/or DNA expression data) for genes associated with tumor cells (e.g., genes listed in Table 1) and for genes associated with TME cells (e.g., genes listed in Table 2); and (b) determining tumor expression levels for the genes associated with tumor cells using multiple machine learning models, each of which corresponds to a gene associated with tumor cells. In some embodiments, determining a tumor expression level for a particular gene associated with tumor cells involves generating a set of features for the particular gene, providing the set of features as input to a respective machine learning model (e.g., a machine learning model trained to estimate a TME expression level of the particular gene) to obtain a TME expression level estimate of the particular gene, and determining the tumor expression level for the particular gene using the TME expression level estimate and a total expression level of the gene. In some embodiments, the determined tumor expression level of the gene may be used to identify a recommended appropriate anti-cancer therapy for the subject, which therapy may then be administered.
In some embodiments, the machine learning techniques used for determining tumor expression levels include using multiple machine learning models, each trained to determine a tumor expression level for a particular respective gene. In some embodiments, the machine learning model may have multiple parameters (e.g., at least 10) and training the machine learning model may include estimating values of those parameters, computationally from training data. The training data may, in some embodiments, include real expression data obtained from sequencing samples and/or simulated expression data obtained by synthesizing these data for purposes of training using the techniques described herein. In some embodiments, generating the simulated expression data may include generating many training sets (e.g., e.g., at least 25,000, at least 50,000, at least 100,000, at least 150,000, at least 200,000, at least 500,000, etc.) for each machine learning model associated with a respective gene.
In some embodiments, the techniques developed by the inventors and described herein may be used in conjunction (e.g., onboard) with one or more sequencing platforms to immediately process the data being generated by the sequencing platforms. As a result, the data provided by the sequencing platform include accurate estimates of expression levels of genes in tumor cell and in their microenvironment. As such, the techniques described herein constitute an improvement to bioinformatics, generally and specifically, to supporting clinical decision making and understanding tumor pathogenesis because the techniques described herein provide for improved methods determining tumor expression levels of genes in tumor cells of a biological sample.
Furthermore, unlike conventional techniques, the techniques described herein account for gene expression that is particular to the biological sample by using expression data, obtained by sequencing the biological sample, as input to a machine learning model trained to estimate the tumor expression level for the particular gene. By accounting for gene expression that is particular to the biological sample, as opposed to relying solely on the average gene expression level from multiple, unrelated biological samples, the techniques determine the tumor expression level for the particular gene with greater accuracy.
Another advantage of the techniques developed by the inventors is that, in some embodiments, the models described herein have been trained with data representing artificial mixtures of cell types, allowing the training process to take into account the diverse and tissue-specific expression of tumor and TME cells across much larger numbers of samples of diverse composition (e.g., simulating a wide variety of tumor microenvironments) than could be practically possible by physically sampling and analyzing tumor samples. This substantially reduces the effort and computational resources associated with training the machine learning models for expression level estimation. The artificial mixes described herein can also be obtained in such a way that they capture a wide biological variability, improving the ability of a machine learning model trained using this data to identify biologically meaningful signals in the presence of such noise and variability. For example, as described herein, a quantitative noise model for technical noise was developed and may be applied to artificial mixes, in some embodiments. Moreover, the RNA expression data used to develop these artificial mixes was derived from multiple different samples, across multiple cell populations having a variety of biological states. These artificial mixes improve the ability of the machine learning models to effectively determine tumor expression levels for genes in tumor cells across real tumor samples.
Consequently, the techniques developed by the inventors provide for an improved diagnostic tool, which enables more accurate identification of treatments for patients, thereby improving clinical outcomes. In particular, by accurately and reliably determining the tumor expression level of a particular gene, the techniques described herein can be used to identify a treatment most effective for treating patients having that particular tumor expression level of a particular gene. By contrast, conventional techniques fail to reliably estimate tumor expression levels, resulting in unreliable and poor identification of anti-cancer treatments.
In addition to identifying therapies for a subject based on tumor expression levels using the techniques described herein, one or more clinical trials may be identified for the subject using the determined tumor expression levels.
Additionally or alternatively, the techniques described herein may be utilized in the context of quality control processes in the laboratory environment. For example, immunohistochemistry techniques may be used to initially estimate the tumor expression of a gene in tumor cells of a biological sample. However, immunohistochemistry is highly subjective since it relies on user observation of the sample under a microscope. Therefore, different users will estimate different values of tumor expression, leading to inconsistent, unreliable, and often inaccurate results. The techniques described herein may be used to objectively confirm or correct the laboratory results.
Accordingly, some embodiments provide for computer-implemented machine learning techniques for estimating tumor expression levels of genes in tumor cells in a biological sample (e.g., having tumor and TME cells) of a subject having cancer. The techniques include: (a) obtaining expression data for a set of genes, the set of genes comprising a first plurality of genes (e.g., at least one, at least some, all of the) genes shown in Table 1) associated with tumor cells and a second plurality of genes associated (e.g., at least one, at least some, all of the) genes shown in Table 2) with the tumor microenvironment cells, the expression data including first total expression levels for genes in the first plurality of genes (e.g., the combined expression of the genes by all cells in the biological sample) and second total expression levels for genes in the second plurality of genes (e.g., the combined expression of the genes by all cells in the biological sample); (b) determining the tumor expression levels (e.g., the expression levels of genes in tumor cells) of the first plurality of genes in the tumor cells using a plurality of machine learning models, the plurality of machine learning models comprising a respective machine learning model for each gene in the first plurality of genes including a first machine learning model for a first gene in the first plurality of genes, the tumor expression levels including a first tumor expression level for the first gene in the tumor cells; and (c) outputting the tumor expression levels (e.g., storing in memory, displaying a graphical user interface (GUI), transmitting to one or more devices, etc.) of the first plurality of genes in the tumor cells.
In some embodiments, determining the tumor expression levels of the first plurality of genes includes: (a) generating a first set of features for the first gene; (b) providing the first set of features as input to the first machine learning model to obtain an output indicative of a TME expression level estimate (e.g., expression level of a gene in TME cells) of the first gene in the TME cells; and (c) determining the first tumor expression level for the first gene in the tumor cells using the output of the first machine learning model and a total expression level, in the first total expression levels, for the first gene (e.g., at least in part by subtracting the TME expression level estimate from the total expression level).
In some embodiments, generating the first set of features for the first gene includes: (a) obtaining, using the expression data, an initial expression level estimate of the first gene in the tumor cells of the biological sample and including the initial expression level estimate of the first gene in the first set of features; (b) including at least some of the first total expression levels (e.g., at least 25, at least 50, at least 75, at least 100, at least 150, etc.) in the first set of features; and (c) including at least some of the second total expression levels (e.g., at least 25, at least 50, at least 75, at least 100, at least 150, etc.) in the first set of features.
In some embodiments, the plurality of machine learning models includes a second machine learning model for a second gene (e.g., one of the genes listed in Table 1) in the first plurality of genes and the tumor expression levels include a second tumor expression level for the second gene in the tumor cells. For example, the second machine learning model may be different from the first machine learning model and the second gene may be different from the first gene. In some embodiments, determining the tumor expression levels of the first plurality of genes further includes: (a) generating a second set of features for the second gene; (b) providing the second set of features as input to the second machine learning model to obtain an output indicative of a TME expression level estimate of the second gene in the TME cells; and (c) determining the second tumor expression level for the second gene in the tumor cells using the output of the second machine learning model and a total expression level, in the first total expression levels, for the second gene.
In some embodiments, generating the second set of features for the second gene includes: (a) obtaining, using the expression data, an initial expression level estimate of the second gene in the tumor cells of the biological sample and including the initial expression level estimate of the second gene in the second set of features; (b) including at least some of the first total expression levels (e.g., at least 25, at least 50, at least 75, at least 100, at least 150, etc.) in the second set of features; and (c) including at least some of the second total expression levels (e.g., at least 25, at least 50, at least 75, at least 100, at least 150, etc.) in the second set of features.
In some embodiments, the plurality of machine learning models includes a third machine learning model for a third gene (e.g., selected from the genes listed in Table 1) in the first plurality of genes and the tumor expression levels include a third tumor expression level for the third gene in the tumor cells. For example, the third machine learning model may be different from both the first and second machine learning models and the second gene may be different from both the first and second genes. In some embodiments, determining the tumor expression levels of the first plurality of genes further includes (a) generating a third set of features for the third gene, (b) providing the third set of features as input to the third machine learning model to obtain an output indicative of a TME expression level estimate of the third gene in the TME cells, and (c) determining the third tumor expression level for the third gene in the tumor cells using the output of the third machine learning model and a total expression level, in the first total expression levels, for the third gene.
In some embodiments, generating the first set of features for the first gene further comprises obtaining, using the expression data, a first plurality of RNA percentages (e.g., by cellular deconvolution) for a respective plurality of types of cells that occur in the TME, wherein each of the first plurality of RNA percentages indicates a percent of RNA (e.g., in the biological sample) associated with the first gene (e.g., produced during expression of the first gene) and originating (e.g., produced by) cells of a respective type (e.g., neutrophils, fibroblasts, etc.) in the biological sample. For example, in some embodiments, obtaining the first plurality of RNA percentages includes processing at least some of the expression (e.g., a portion or all of the expression data) using at least one non-linear regression model.
In some embodiments, generating the first set of features for the first gene further comprises including at least some of the first plurality of RNA percentages in the first set of features
In some embodiments, the TME cells comprise TME cells of a first type and TME cells of a second type (e.g., different from the first type). In some embodiments, the at least some of the expression data includes a first subset of the expression data and a second subset (e.g., different from the first subset) of the expression data. In some embodiments, the at least one non-linear regression model includes a first non-linear regression model and a second non-linear regression model different from the first non-linear regression model. In some embodiments, obtaining the first plurality of RNA percentages includes (a) processing the first subset of the expression data using the first non-linear regression model to obtain a first RNA percentage for the TME cells of the first type; and (b) processing the second subset of the expression data using the second non-linear regression model to obtain a second RNA percentage for the TME cells of the second type.
In some embodiments, the first type of TME cells and second type of TME cells are each selected from the group consisting of B cells, CD4+ T cells, CD8+ T cells, endothelial cells, fibroblasts, lymphocytes, macrophages, monocytes, NK cells, and neutrophils, wherein the first type is different from the second type. However, it should be appreciated that the cell type could be any suitable type of TME cell, as aspects of the technology described herein are not limited to any particular type of TME cell.
In some embodiments, obtaining the initial expression level estimate of the first gene in the tumor cells of the biological sample includes (a) obtaining an average TME expression level (e.g., obtained based on previously-determined expression levels of the first gene in TME cells of different biological samples) of the first gene for each of the plurality of types of cells that occur in the TME; (b) determining a weighted sum of the obtained expression levels based on the first plurality of RNA percentages (e.g., by multiplying the first plurality of RNA percentages with respective average expression levels); and (c) subtracting the weighted sum from the total expression level for the first gene to obtain the initial expression level estimate.
In some embodiments, the techniques further include obtaining, using the expression data, a first RNA percentage for the tumor cells, wherein the first RNA percentage indicates a percent of RNA associated with the first gene and originating from the tumor cell of the biological sample. For example, the first RNA percentage may be obtained using the techniques for obtaining RNA percentages for the types of cells that occur in the TME.
In some embodiments, the expression data has been previously obtained at least in part by sequencing (e.g., RNA or DNA sequencing) the biological sample of the subject having cancer.
In some embodiments, the at least some of the first total expression levels included in the first set of features include total expression levels for at least 25 genes, at least 50 genes, at least 75 genes, at least 100 genes, or at least 150 genes in the first plurality of genes associated with tumor cells. In some embodiments, the plurality of machine learning models comprises at least 25 machine learning models, at least 50 machine learning models, at least 75 machine learning models, at least 100 machine learning models, or at least 150 machine learning models corresponding to the at least 25 genes, at least 50 genes, at least 75 genes, at least 100 genes, or at least 150 genes, respectively.
In some embodiments, each machine learning model of the at least 25 machine learning models (at least 50 machine learning models, at least 75 machine learning models, at least 100 machine learning models, or at least 150 machine learning models, etc.) comprises a different gradient boost model.
In some embodiments, the at least some of the first total expression levels included in the first set of features include total expression levels for at least 10 genes selected from genes listed in Table 1. In some embodiments, the at least some of the first total expression levels included in the first set of features include total expression levels for at least 25 genes selected from genes listed in Table 1. In some embodiments, the at least some of the first total expression levels included in the first set of features include total expression levels for at least 50 genes selected from genes listed in Table 1. In some embodiments, the at least some of the first total expression levels included in the first set of features include total expression levels for at least 75 genes selected from genes listed in Table 1. In some embodiments, the at least some of the first total expression levels included in the first set of features include total expression levels for at least 100 genes selected from genes listed in Table 1. In some embodiments, the at least some of the first total expression levels included in the first set of features include total expression levels for at least 150 genes selected from genes listed in Table 1.
In some embodiments, the first machine learning model of the plurality of machine learning models is a gradient boosted model (e.g., trained using a gradient boosting framework such as LightGBM, Catboost, XGBoost, Adaboost, etc.).
In some embodiments, the techniques further include training the first machine learning model by (a) obtaining training data comprising simulated expression data for genes in the set of genes, wherein the training data is associated with one or more biological samples (e.g., tumor and/or non-tumor samples obtained from one or more subjects); (b) generating, using the training data, a training set of features for the first gene; and (c) training the first machine learning model to estimate a TME expression level of the first gene. In some embodiments, the training includes providing the training set of features as input to the first machine learning model to obtain an output comprising an estimate of the TME expression level of the first gene in the TME cells of the one or more biological samples and updating parameters of the first machine learning model using the estimate of the TME expression level.
In some embodiments, generating the training set of features for the first gene includes obtaining, using the simulated expression data, an initial expression level estimate of the first gene in tumor cells of the one or more biological samples and including the initial expression level estimate in the training set of features and including at least some of the simulated expression levels in the training set of features (e.g., at least some expression levels of genes associated with tumor cells and at least some expression levels of genes associated with TME cells).
In some embodiments, the first machine learning model was trained at least in part by generating training data comprising simulated expression data. In some embodiments, generating the training data includes (a) obtaining training expression data for each of one or more biological samples, the training expression data comprising first training expression levels for the first plurality of genes (e.g., associated with tumor cells) and second training expression levels for the second plurality of genes (e.g., associated with TME cells); (b) generating first simulated expression data using the first training expression levels; (c) generating second simulated expression data using the second training expression levels; and (d) combining the first simulated expression data and the second simulated expression data to produce at least part of the simulated expression data.
In some embodiments, the techniques further include identifying at least one anti-cancer therapy for the subject based on the first tumor expression level for the first gene in the tumor cells. For example, an anti-cancer therapy may be identified for the subject if the first tumor expression level satisfies some criteria (e.g., falls within a range of expression levels, exceeds a threshold expression level, is lower than a threshold expression level, etc.). In some embodiments, the techniques further comprise administering the at least one anti-cancer therapy.
In some embodiments, the at least one anti-cancer therapy is selected from the group of therapies for the first gene listed in Table 3.
In some embodiments, identifying the at least one anti-cancer therapy includes determining whether the first tumor expression level satisfies at least one criterion associated with the first gene and after determining that the first tumor expression level satisfies the at least one criterion, selecting the at least one anti-cancer therapy from the group of therapies listed for the first gene in Table 3. For example, the at least one criterion may be particular to the first gene.
Following below are more detailed descriptions of various concepts related to, and embodiments of, the cellular deconvolution systems and methods developed by the inventors. It should be appreciated that various aspects described herein may be implemented in any of numerous ways. Examples of specific implementations are provided herein for illustrative purposes only. In addition, the various aspects described in the embodiments below may be used alone or in any combination and are not limited to the combinations explicitly described herein.
In some embodiments, the illustrative technique 100 may be implemented in a clinical or laboratory setting. For example, the technique 100 may be implemented on a computing device 104 that is located within the clinical or laboratory setting. In some embodiments, the computing device 104 may directly obtain the expression data 103 from a sequencing platform 102 located within the clinical or laboratory setting. For example, a computing device 104 included in the sequencing platform 102 may directly obtain the expression data 103 via a communication network, such as Internet or any other suitable network, as aspects of the technology described herein are not limited to any particular communication network.
Additionally or alternatively, the illustrative technique 100 may be implemented in a setting that is remote from a clinical or laboratory setting. For example, the illustrated technique 100 may be implemented on computing device 104 that is located externally from a clinical or laboratory setting. In this case, the computing device may indirectly obtain expression data 103 that is generated using a sequencing platform 102 located within or external to a clinical or laboratory setting. For example, the expression data 103 may be provided to computing device 104 via a communication network, such as Internet or any other suitable network, as aspects of the technology described herein are not limited to any particular communication network.
As shown in
In some embodiments, the sequencing platform 102 may be a next generation sequencing platform (e.g., Illumina™, Roche™, Ion Torrent™, etc.), or any high-throughput or massively parallel sequencing platform. In some embodiments, the sequencing platform 102 may include any suitable sequencing device and/or any sequencing system including one or more devices. In some embodiments, the sequencing methods may be automated, in some embodiments, there may be manual intervention. In some embodiments, the expression data 103 may be obtained using techniques other than next generation sequencing (e.g., Sanger sequencing, microarrays, etc.).
Expression data 103 may include the sequence data generated by a sequencing protocol (e.g., the series of nucleotides in a nucleic acid molecule identified by next-generation sequencing, Sanger sequencing, etc.) as well as information contained therein (e.g., information indicative of source, tissue type, etc.) which may also be considered information that can be inferred or determined from the sequence data. In some embodiments, expression data 103 may include information included in a FASTA file, a description and/or quality scores included in a FASTQ file, an aligned position included in a BAM file, and/or any other suitable information.
The expression data 103 may be generated by sequencing biological sample 101. Biological sample 101 may include nucleic acid. A nucleic acid may include one or multiple nucleic acid molecules.
In some embodiments, the nucleic acid is RNA. In some embodiments, sequenced RNA comprises both coding and non-coding transcribed RNA found in a sample. When such RNA is used for sequencing the sequencing is said to be generated from “total RNA” and also can be referred to as whole transcriptome sequencing. Alternatively, the nucleic acids can be prepared such that the coding RNA (e.g., mRNA) is isolated and used for sequencing. This can be done through any means known in the art, for example by isolating or screening the RNA for polyadenylated sequences. This is sometimes referred to as mRNA-Seq.
In some embodiments, the nucleic acid is DNA. In some embodiments, the nucleic acid is prepared such that the whole genome is present in the nucleic acid. In some embodiments, the nucleic acid is processed such that only the protein coding regions of the genome remain (e.g., the exome). When nucleic acids are prepared such that only the exome is sequenced, it is referred to as whole exome sequencing (WES). A variety of methods are known in the art to isolate the exome for sequencing, for example, solution-based isolation wherein tagged probes are used to hybridize the targeted regions (e.g., exons) which can then be further separated from the other regions (e.g., unbound oligonucleotides). These tagged fragments can then be prepared and sequenced.
In some embodiments, expression data 103 may include raw DNA or RNA sequence data, DNA exome sequence data (e.g., from whole exome sequencing (WES), DNA genome sequence data (e.g., from whole genome sequencing (WGS)), RNA expression data, gene expression data, bias-corrected gene expression data, or any other suitable type of sequence data comprising data obtained from the sequencing platform 102 and/or comprising data derived from data obtained from sequencing platform 102. In some embodiments, the origin or preparation of the expression data 103 may include any of the embodiments described with respect to the “Expression Data” and “Obtaining Expression Data” sections.
In some embodiments, the expression data 103 includes gene expression levels. Gene expression levels may be detected by detecting a product of gene expression such as mRNA and/or protein. In some embodiments, gene expression levels are determined by detecting a level of a mRNA in a sample. As used herein, the terms “determining” or “detecting” may include assessing the presence, absence, quantity and/or amount (which can be an effective amount) of a substance within a sample, including the derivation of qualitative or quantitative concentration levels of such substances, or otherwise evaluating the values and/or categorization of such substances in a sample from a subject. Example techniques for processing sequencing data to obtain expression data, including expression levels, are described herein including at least with respect to
In some embodiments, the gene expression levels include total expression levels. As referred to herein, the “total expression level” for a gene is a numeric value quantifying the degree to which the gene is expressed in the biological sample 101. The total expression level for a gene may reflect the combined expression of the gene in both tumor and TME cells of the biological sample. As such, the total expression level for a particular gene may not distinguish between the expression of that particular gene in tumor cells and the expression of that particular gene in TME cells.
In some embodiments, a total expression level is obtained for each of multiple genes. For example, total expression levels may be obtained for at least 10 genes, at least 25 genes, at least 50 genes, at least 75, genes, at least 100 genes, at least 150 genes, at least 200 genes, at least 250 genes, at least 300 genes, at least 350 genes, at least 400 genes, at least 450 genes, at least 500 genes, at least 550 genes, at least 600 genes, or more genes.
In some embodiments, the genes include genes associated with tumor cells and genes associated with TME cells. In some embodiments, genes “associated with tumor cells” include those that are predominantly expressed in tumor cells. Nonlimiting examples of genes associated with the tumor cells include those listed in Table 1. In some embodiments, genes “associated with TME cells” include those that are predominantly expressed in TME cells. Nonlimiting examples of genes associated with TME cells include those listed in Table 2.
In some embodiments, the expression data 103 includes total expression levels for at least some of the genes associated with tumor cells and at least some of the genes associated with TME cells. For example, expression data 103 may include total expression levels for at least 10, at least 25, at least 30, at least 40, at least 50, at least 60, at least 75, at least 100, at least 150, or more genes associated with tumor cells. The genes may be selected, for example, from those listed in Table 1. Additionally or alternatively, expression data 103 may include total expression levels for at least 10, at least 25, at least 30, at least 40, at least 50, at least 60, at least 75, at least 100, at least 150, at least 200, at least 250, at least 300, at least 350, at least 400, or more genes associated with TME cells. The genes may be selected, for example, from those listed in Table 2.
Regardless of the type of expression data 103 obtained, the expression data 103 is processed using computing device 104. The computing device 104 can be one or multiple computing devices of any suitable type. For example, the computing device 104 may be a portable computing device (e.g., laptop, a smartphone) or a fixed computing device (e.g., a desktop computer, a server). When computing device 104 includes multiple computing devices, the device(s) may be physically co-located (e.g., in a single room) or distributed across multiple physical locations. In some embodiments, the computing device 104 may be part of a cloud computing infrastructure. In some embodiments, one or more computer(s) 104 may be co-located in a facility operated by an entity (e.g., a hospital, a research institution). In some embodiments, the one or more computing device(s) 104 may be physically co-located with a medical device, such as a sequencing platform 102. For example, a sequencing platform 102 may include computing device 104.
In some embodiments, the computing device 104 may be operated by a user such as a doctor, clinician, researcher, patient, or other individual. For example, the user may provide the expression data 103 as input to the computing device 104 (e.g., by uploading a file), and/or may provide user input specifying processing or other methods to be performed using the expression data 103.
In some embodiments, expression data 103 may be processed by one or more software programs running on computing device 104 (e.g., as described herein including at least with respect to
In some embodiments, each of the plurality of machine learning models is of any suitable type. For example, each of the machine learning models may be a gradient boosted machine learning model (e.g., a first gradient boosted machine learning model, a second gradient boosted machine learning model, etc). The gradient boosted machine learning model may be a gradient boosted decision tree model or using any other suitable type of model as “weak learner” boosted via gradient boosting or any other suitable boosting approach. In some embodiments, the gradient boosted ML model may be trained using a gradient boosting framework such as XGBoost, LightGBM, Catboost, or Adaboost.
It should be appreciated that a machine learning model of the plurality of machine learning models need not be a gradient boosted machine learning model and that other types of machine learning models may be used. For example, in some embodiments, a non-linear regression model (e.g., a logistic regression model), a neural network model, a support vector machine, a Gaussian mixture model, a random forest model, a decision tree model, or any other suitable type of machine learning model, as aspects of the technology described herein are not limited in this respect.
In some embodiments, a machine learning model is trained to estimate a TME expression level of a gene associated with tumor cells. As referred to herein, the “TME expression level” of a gene is a numeric value quantifying the degree to which the gene is expressed in TME cells of a biological sample. For example, a first machine learning model may be trained to estimate a TME expression level of a first gene in the biological sample 101 and a second machine learning model may be trained to estimate a TME expression level of a second gene in the biological sample 101. Illustrative techniques for processing the expression data to estimate TME expression levels are described herein, including at least with respect to act 224 of process 220, shown in
Based on the outputs of the machine learning models, including the output of the first machine learning model, in some embodiments, tumor expression level(s) 105 are determined for at least one of the genes associated with tumor cells. For example, the tumor expression level(s) 105 may include a first tumor expression level for a first gene associated with tumor cells. As referred to herein, the “tumor expression level” of a gene is a numeric value quantifying the degree to which the gene is expressed in tumor cells of a biological sample. Illustrative techniques for processing the expression data to estimate tumor expression levels are described herein, including at least with respect to act 226 of process 220, shown in
In some embodiments, the tumor expression level(s) 105 may be provided as output. For example, the tumor expression level(s) 105 may be used to generate a report to be output to a user (e.g., via a graphical user interface (GUI).
In some embodiments, the tumor expression level(s) 105 may be used to identify a tumor-specific treatment for the subject from which the biological sample 101 was obtained. For example, the expression of a gene may be associated with at least one treatment known to be effective in treating tumors that express that gene (e.g., at a particular expression level). Such a treatment may be identified to treat the biological sample 101 and, in some embodiments, subsequently administered to the subject. For example, Table 3 lists treatments associated respectively with the expression of particular genes associated with tumor cells.
Additionally or alternatively, the tumor expression level(s) 105 may be used to confirm tumor expression levels previously estimated for the biological sample 101. For example, immunohistochemistry results may be received from a lab or a clinical setting. The illustrative techniques 100 may include comparing the immunohistochemistry results to the tumor expression level(s) 105 determined for the biological sample 101. If the expression levels do not match, this may indicate that the biological sample 101 used to obtain the tumor expression level(s) 105 is not reliable or that the immunohistochemistry results are not reliable. Therefore, discrepancies between the obtained expression levels can be used to identify issues of quality control, which may be reported back to the appropriate lab or clinical setting.
In the embodiment of
In some embodiments, the set of genes includes genes associated with tumor cells, and the expression data includes total expression levels for the genes associated with tumor cells. In some embodiments, the set of genes includes at least 10, at least 25, at least 30, at least 40, at least 50, at least 60, at least 75, at least 100, at least 150, or more genes associated with tumor cell. For example, the set of genes may include a subset (e.g., at least some or all) of the genes listed in the Table 1, and the expression data may include total expression levels for those genes.
In some embodiments, the set of genes also includes genes associated with TME cells, and the expression data includes total expression levels for the genes associated with TME cells. In some embodiments, the set of genes includes at least 10, at least 25, at least 30, at least 40, at least 50, at least 60, at least 75, at least 100, at least 150, at least 200, at least 250, at least 300, at least 350, at least 400, or more genes associated with TME cells. For example, the set of genes may include a subset (e.g., at least some or all) of the genes listed in the Table 2, and the expression data may include total expression levels for those genes.
In some embodiments, the expression data is obtained using any suitable techniques from any suitable location such as, for example, a data store (e.g., expression data store 446 of
Process 200 then proceeds to act 204, where tumor expression levels of genes associated with tumor cells are determined. In some embodiments, determining a tumor expression level for the genes includes using machine learning models corresponding, respectively, to the genes associated with tumor cells. For example, determining a first tumor expression level for a first gene includes using a first machine learning model corresponding to the first gene.
In some embodiments, act 204 includes determining a tumor expression level for a set (e.g., at least some or all) of the genes listed in Table 1. For example, act 204 may include determining a tumor expression level for at least 10, at least 25, at least 30, at least 40, at least 50, at least 60, at least 75, at least 100, at least 150 or all of the genes listed in Table 1. Techniques for determining a tumor expression level for a gene are described herein, including at least with respect to
At act 206, the tumor expression levels of the genes associated with tumor cells are output. In some embodiments, the tumor expression levels are made accessible to a user (e.g., a clinician, a researcher, etc.). For example, the tumor expression levels may be displayed via a user interface (e.g., a graphical user interface (GUI)), stored locally in non-transitory storage medium, stored in a remote database or a cloud storage environment, and/or transmitted to one or more external computing devices.
In some embodiments, the tumor expression level of a particular gene is associated with one or more anti-cancer therapies. For example, a particular therapy may be known to effectively treat tumors expressing the particular gene. Additionally or alternatively, a particular therapy be known to ineffectively treat tumors expressing the particular gene.
Accordingly, in some embodiments, at act 208 the output tumor expression levels are used to identify an anti-cancer therapy for administration to the subject. In some embodiments, this includes determining whether an output tumor expression level satisfies one or more criteria. In some embodiments, the criteria vary for each gene and its associated therapies. For example, a therapy may effectively treat tumors that express a particular gene (e.g., a tumor expression level of the gene that exceeds 0). By contrast, a therapy may effectively treat tumors that overexpress or under-express a gene (e.g., tumor expression levels that exceed or fall below an average expression of the gene).
Aspects of the disclosure relate to identification and/or selection of therapeutic agents (e.g., anti-cancer therapies) that are associated with a particular gene. A therapeutic agent that is “associated with a particular gene” refers to a therapeutic agent that interacts (e.g., binds to, inhibits activity or function, decreases activity or function, or alters activity or function) with a gene product (e.g., a nucleic acid such as DNA or RNA, a peptide, protein, etc.) expressed by the particular gene. For example, a therapeutic agent associated with a gene encoding a kinase (e.g., ALK) may bind to or interact with a nucleic acid (e.g., mRNA transcribed from the gene (e.g., ALK gene) or a protein (e.g., ALK protein) expressed by the gene. In some embodiments, a therapeutic agent associated with a particular gene may interact directly (e.g., bind to or directly inhibit) the particular gene. In some embodiments, a therapeutic agent associated with a particular gene may interact indirectly with the particular gene (e.g., bind to or inhibit a modulator of the particular gene). A therapeutic agent may be a small molecule (e.g., small molecule inhibitor, for example a kinase inhibitor, DNA methyltransferase inhibitor, topoisomerase inhibitor, etc.), nucleic acid (e.g., inhibitory nucleic acid such as dsRNA, siRNA, miRNA, etc., or a therapeutic mRNA), peptide, or protein (e.g., antibody, toxin, etc.). In some embodiments, the therapeutic agent is approved by a government regulatory agency (e.g., the US Food and Drug Administration) for treatment of cancer. FDA-approved agents are known in the art and are described, for example in the FDA Orange Book or FDA Purple Book. Table 3 lists therapies associated with tumor expression of particular genes. In some embodiments, act 208 comprises identifying one or more therapies listed in Table 3.
In some embodiments, implementing process 200 may include additional or alternative steps that are not shown in
Process 220 begins at act 222, where a first set of features for a first gene associated with tumor cells is generated. In some embodiments, generating the first set of features includes including, in the first set of features, at least some of the expression data obtained at act 202 of process 200. The included expression data may include, for example, total expression levels for at least some genes associated with tumor cells. Additionally or alternatively, the included expression data may include total expression levels for at least some genes associated with TME cells. Example techniques for including expression data in the first set of features are described herein including at least with respect to acts 252 and 254 of process 250, depicted in
In some embodiments, generating the first set of features for the first gene further includes determining an initial expression level estimate for the first gene in the tumor cells. For example, the initial expression level estimate of the first gene in the tumor cells may represent an estimate of the tumor expression level of the first gene in the tumor cells, prior to using a machine learning model to determine an updated tumor expression level of the first gene. In some embodiments, determining an initial expression level estimate for the first gene includes estimating the TME expression level of the first gene and subtracting the TME expression level estimate of the first gene from the total expression level of the first gene. Example techniques for determining an initial expression level estimate are described herein including at least with respect to act 256 of process 250, depicted in
In some embodiments, generating the first set of features for the first gene includes, obtaining a first plurality of RNA percentages for a respective plurality of cell types in the biological sample and including the first plurality of RNA percentages in the first set of features. As referred to herein, in some embodiments, an “RNA percentage” for a particular cell type is indicative of the percent of RNA sequence reads (e.g., obtained using a sequencing platform) that have aligned to a particular gene (e.g., the first gene) that originate from a particular cell type. For example, for the first gene, the RNA percentage for a first cell type is indicative of the percentage of RNA sequence reads that have aligned to the first gene and that originate from cells of the first cell type in the biological sample.
In some embodiments, obtaining the first plurality of RNA percentages for a respective plurality of cell types includes obtaining an RNA percentage for each of a plurality of TME cell types (e.g., neutrophils, fibroblasts, NK cells, etc.) in the biological sample. In some embodiments, obtaining the first plurality of RNA percentages includes obtaining an RNA percentage for tumor cells in the biological sample.
In some embodiments, RNA percentages are obtained using machine learning techniques. Example techniques for determining RNA percentages are described in the section “Cellular Deconvolution”. Some aspects of determining RNA percentages are also described in U.S. Patent Publication No. 2021-0287759, entitled “SYSTEMS AND METHODS FOR DECONVOLUTION OF EXPRESSION DATA”, the entire contents of which is herein incorporated by reference in its entirety.
At act 224, the first set of features is provided as input to a first machine learning model to obtain an output indicative of a TME expression level estimate for the first gene. In some embodiments, the TME expression level estimate is an estimated expression level of the first gene in the TME cells of the biological sample.
In some embodiments, the first machine learning model is of any suitable type. For example, in some embodiments, the first machine learning model may be a gradient boosted machine learning model. The gradient boosted machine learning model may be a gradient boosted decision tree model or using any other suitable type of model as “weak learner” boosted via gradient boosting or any other suitable boosting approach. In some embodiments, the gradient boosted ML model may be trained using a gradient boosting framework such as XGBoost, LightGBM, Catboost, or Adaboost.
It should be appreciated that the first machine learning model need not be a gradient boosted machine learning model and that other types of ML models may be used. For example, in some embodiments, a non-linear regression model (e.g., a logistic regression model), a neural network model, a support vector machine, a Gaussian mixture model, a random forest model, a decision tree model, or any other suitable type of machine learning model, as aspects of the technology described herein are not limited in this respect.
In some embodiments, the machine learning model includes multiple parameters whose values may be estimated using training data. The process of estimating parameter values of parameters in an ML model using training data is referred to as “training” the ML model. In some embodiments, a machine learning model includes one or more hyperparameters in addition to the multiple parameters. Values of the hyperparameters may be estimated during training as well. Example techniques for training the first machine learning model are described herein including at least with respect to
At act 226, a first tumor expression level is determined for the first gene. In some embodiments, the first tumor expression level is the predicted expression level of the first gene in tumor cells of the biological sample.
In some embodiments, determining the first tumor expression level includes using the output of the first machine learning model and the total expression level of the first gene (e.g., obtained at act 202 of process 200). This may include, for example, subtracting the TME expression level estimate (TME1) for the first gene from the total expression level (Total1) of the first gene to obtain the (unscaled) first tumor expression level (Tumorunscaled,1), as shown in Equation 1.
Tumorunscaled,1=Total1−TME1 (Equation 1)
In some embodiments, determining the tumor expression level for the first gene is further based on a predicted RNA percentage of the tumor cells in the biological sample. For example, the RNA percentage (RP1) of the tumor cells may be used to scale (e.g., divide) the difference between the total expression level and the TME expression level estimate to obtain the (scaled) first tumor expression level, as shown in Equation 2.
At act 228, process 220 includes determining whether there is another gene associated with tumor cells for which a tumor expression level should be determined. When it is determined, at act 228, that there is another gene for which the tumor expression level is to be determined, acts 222-226 are repeated for the next gene. For example, for a second gene, this would include determining a second set of features, providing the second set of features as input to a second machine learning model to obtain an output indicative of a TME expression level estimate of the second gene in the TME cells, and determining a second tumor expression level for second gene.
Process 250 begins at act 252, where an initial expression level estimate of the first gene in the tumor cells of the biological sample is obtained.
In some embodiments, the initial expression level estimate is obtained using the expression data obtained at act 202 of process 200. For example, the expression data may be used to obtain, for the first gene, RNA percentages for different TME cell populations (e.g., TME cells of a first type, TME cells of a second type, etc.) in the biological sample. Example techniques for determining RNA percentages are described herein including in the section “Cellular Deconvolution” and in U.S. Patent Publication No. 2021-0287759, entitled “SYSTEMS AND METHODS FOR DECONVOLUTION OF EXPRESSION DATA”, the entire contents of which is herein incorporated by reference in its entirety.
In some embodiments, the initial expression level estimate is further obtained using average expression levels of first gene in each of various TME cell populations (e.g., the average expression levels of the first gene in TME cells of the first type, the average expression levels of the first gene in TME cells of the second type, the average expression levels of the first gene in TME cells of the Nth type, etc.) In some embodiments, the average expression level of a gene in a particular cell population is obtained by averaging the expression level of the gene in the cell population across different biological or artificial samples. For example, the average expression level of a gene in a TME cell population may be determined by computing the average expression level of the gene in the TME cell population in the training samples described with respect to
In some embodiments, the RNA percentages and average expression levels are used to determine a weighted sum that represents an initial expression level estimate of the first gene in TME cells of the biological sample. Equation 3 shows an example equation for determining an initial TME expression level estimate (TMEinitial,1) for the first gene in TME cells of a biological sample including k TME cell populations.
TMEintiail,1=Σk(RPk)*(Expk) (Equation 3)
Where RPk represents the RNA percentage for the kth TME cell population and EXPN represents the average TME expression level of the first gene in the kth TME cell population.
In some embodiments, the initial TME expression level estimate of the first gene is used to determine the initial tumor expression level estimate of the first gene in the tumor cells of the biological sample. For example, the initial TME expression level estimate of the first gene may be subtracted from the total expression level (Total1) of the first gene in the biological sample, obtained at act 202 of process 200. Equation 4 shows an example equation for determining an initial expression level estimate (Tumorinitial,1) of the first gene in tumor cells the biological sample.
Tumorinitial,1=Total1−TMEinitial,1 (Equation 4)
In some embodiments, the obtained initial expression level estimate of the first gene in the tumor cells is included in the first set of features at act 252 of process 250. For example, the initial expression level estimate may be provided as input to the first machine learning model at act 224 of process 220, along with other features included in the first set of features.
At act 254 of process 250, at least some of the total expression levels for genes associated with tumor cells are included in the first set of features. For example, the total expression levels include those obtained at act 202 of process 200.
In some embodiments, all the obtained total expression levels for the genes associated with tumor cells is included in the first set of features. In some embodiments, only a subset of the total expression levels is included in the first set of features. For example, in some embodiments, total expression levels for at least 10, at least 25, at least 30, at least 40, at least 50, at least 60, at least 75, at least 100, at least 150 or all of the genes listed in Table 1 are included in the first set of features.
In some embodiments, the subset that is included in the first set of features depends on the type of cancer that the subject has or is suspected of having. For example, Table 3 lists genes associated with different types of cancer. For a patient having or suspected of having a particular type of cancer, total expression levels for genes associated with tumor cells and associated with the type of cancer may be included in the first set of features.
In some embodiments, the subset of features to be included in the first set of features is identified as part of training the first machine learning model. Kursa et al. (Boruta—A System for Feature Selection, Fundamenta Informaticae, 2010; 101(4):271-285), incorporated by reference herein in its entirety, describes techniques for identifying features to be used as input to a machine learning model.
At act 256 of process 250, at least some of the total expression levels for genes associated with TME cells are included in the first set of features. For example, the total expression levels include those obtained at act 202 of process 200.
In some embodiments, all the obtained total expression levels for the genes associated with TME cells are included in the first set of features. In some embodiments, only a subset of the total expression levels is included in the first set of features. For example, in some embodiments, total expression levels for at least 10, at least 25, at least 30, at least 40, at least 50, at least 60, at least 75, at least 100, at least 150, at least 200, at least 250, at least 300, at least 350, at least 400 or all of the genes listed in Table 2 are included in the first set of features.
In some embodiments, the subset that is included in the first set of features depends on the type of cancer that the subject has or is suspected of having. For example, Table 3 lists genes associated with different types of cancer. For a patient having or suspected of having a particular type of cancer, total expression levels for genes associated with TME cells and associated with the type of cancer may be included in the first set of features.
In some embodiments, though not shown, generating the first set of features includes obtaining a first plurality of RNA percentages for cell types in the biological sample and including the first plurality of RNA percentages in the first set of features. For example, this may include obtaining a first RNA percentage for a TME cell of a first type and determining a second RNA percentage for a TME cell of a second type. Additionally or alternatively, this may include obtaining a second RNA percentage for tumor cells in the biological sample.
In some embodiments, RNA percentages are obtained using machine learning techniques. Example techniques for determining RNA percentages are described in the section “Cellular Deconvolution”. Some aspects of determining RNA percentages are also described in U.S. Patent Publication No. 2021-0287759, entitled “SYSTEMS AND METHODS FOR DECONVOLUTION OF EXPRESSION DATA”, the entire contents of which is herein incorporated by reference in its entirety.
In some embodiments, features to be included in the first set of features is identified as part of training the first machine learning model. Kursa et al. (Boruta—A System for Feature Selection, Fundamenta Informaticae, 2010; 101(4):271-285), incorporated by reference herein in its entirety, describes techniques for identifying features to be used as input to a machine learning model.
It should be appreciated that process 250 may include, in some embodiments, one or more additional acts for including one or more additional features in the first set of features, as aspects of the technology described herein are not limited in this respect. For example, generating the first set of features using process 250 may include obtaining and/or including one or more additional features to be included in the first set of features.
As shown in
In some embodiments, the biological sample 301 is processed or may have been previously processed to obtain expression data 303. For example, the expression data may be generated using a sequencing platform (e.g., sequencing platform 102 shown in
In some embodiments, the expression data 303 includes expression data for genes associated with tumor cells (also referred to herein as “tumor genes”) and genes associated with TME cells (also referred to herein as “TME genes”). In some embodiments, the tumor genes include a number of genes N and the TME genes include a number of genes M, which may be the same of different from N. For example, the tumor genes may include N genes listed in Table 2 and the TME genes may include M genes listed in Table 3. Additionally or alternatively, the N tumor genes may include at least 10 genes, at least 25 genes, at least 35 genes, at least 50 genes, at least 75 genes, at least 100 genes, at least 120 genes, between 10 and 130 genes, between 25 and 100 genes, between 50 and 100 genes, etc. The M TME genes may include at least 10 genes, at least 25 genes, at least 35 genes, at least 50 genes, at least 75 genes, at least 100 genes, at least 150 genes, at least 175 genes, at least 200 genes, at least 250 genes, at least 300 genes, at least 350 genes, at least 400 genes, at least 450 genes, between 10 and 475 genes, between 25 and 400 genes, between 50 and 350 genes, between 100 and 300 genes, etc.
In some embodiments, the expression data 303 includes the total expression level for each of the listed tumor genes and each of the listed TME genes. For example, the expression data 303 includes the total expression level for a first gene associated with tumor cells and the total expression level for a first gene associated with TME cells.
In some embodiments, the expression data 303 is used to generate a set of features for each of the genes associated with tumor cells. For example, the expression data 303 is used to generate a first set of features 304a for the first tumor gene, a second set of features 304b for the second tumor gene, and an Mth set of features 304c for the Mth tumor gene. In some embodiments, all of the expression data 303 is used to generate a set of features for a gene. Additionally or alternatively, only a subset of the expression data (e.g., only a subset of the total expression levels of the tumor genes and/or TME genes) is used to generate a set of features for a gene. Example techniques for generating a set of features for a gene are described herein including at least with respect to
In some embodiments, each set of features is provided as input to a respective machine learning model to obtain a corresponding output. For example, the first set of features 304a is provided as input to a first machine learning model 306a to obtain an output 308a indicative of the TME expression level estimate of the first gene in TME cells 301b of the biological sample 301. The second set of features 304b is provided as input to a second machine learning model 306b to obtain an output 308b indicative of the TME expression level estimate of the second gene in TME cells 301b of the biological sample. The Mth set of features is provided as input to an Mth machine learning model 306c to obtain an output 308c indicative of the TME expression level estimate of the Mth gene in TME cells 301b of the biological sample. Example techniques for using a machine learning model to obtain an output indicative of a TME expression level estimate of a gene are described herein including at least with respect to act 224 of process 220 shown in
In some embodiments, the output of each machine learning model is used to determine a tumor expression level estimate of the gene. For example, the output 308a of the first machine learning model 306a is used to determine the tumor expression level 310a for the first gene in the tumor cells 301a of the biological sample 301. The output 308b of the second machine learning model 306b is used to determine the tumor expression level 310b for the second gene in the tumor cells 301b of the biological sample 301. The output 308c of the Mth machine learning model 306c is used to determine the tumor expression level 310c for the Mth gene in the tumor cells 301c of the biological sample 301. Example techniques for using the output of a machine learning model to determine the tumor expression level of a gene are described herein including at least with respect to act 226 of process 220 shown in
As shown in
In some embodiments, the first set of features 304a includes any suitable features for the first gene including, for example, an initial expression level estimate 352a for the first gene, at least some of the total expression levels 354a for the tumor genes, at least some of the total expression levels 356a for the TME genes, and/or a first plurality of RNA percentages 358a. It should be appreciated that the first set of features 304a may include additional or fewer features than those shown in
In some embodiments, the initial expression level estimate 352a may be based on (a) the total expression level for the first gene in the biological sample, (b) RNA percentages for the TME cell populations 301b (e.g., RNA percentages for TME cell populations of Type A 322, Type B 324, and Type C 326), and (c) average expression levels of the first gene in each of the TME cell populations. Example techniques for determining an initial expression level estimate are described herein including at least with respect to act 252 of process 250, shown in
In some embodiments, the total expression levels 354a for the tumor genes include all or a subset of the total expression levels included in the expression data 303 for genes 1-M. For example, the subset of the total expression levels may be selected based on a type of cancer that the subject has or is suspected of having. Example techniques for identifying the total expression levels for tumor genes to be included in a set of features are described herein including at least with respect to act 254 of process 250, shown in
In some embodiments, the total expression levels 356a for the TME genes include all or a subset of the total expression levels included in the expression data 303 for genes 1-N. For example, the subset of the total expression levels may be selected based on a type of cancer that the subject has or is suspected of having. Example techniques for identifying the total expression levels for TME genes to be included in a set of features are described herein including at least with respect to act 256 of process 250, shown in
In some embodiments, the first plurality of RNA percentages 358a include RNA percentages for each of multiple cell types in the biological sample. In some embodiments, each of the first plurality of RNA percentages 358a is indicative of the percent of RNA sequence reads that have aligned to the first gene that originate from a particular cell type in the biological sample. For example, the first plurality of RNA percentages may include a first RNA percentage indicative of the percentage of RNA sequence reads that have aligned to the first gene that originate from the first cell type. The first plurality of RNA percentages 358a may include RNA percentages for one or more TME population of different cell types and/or an RNA percentage for tumor cells in the biological sample.
In some embodiments, the second set of features 304b includes any suitable features for the second gene including, for example, an initial expression level estimate 352b for the second gene, at least some of the total expression levels 354b for the tumor genes, at least some of the total expression levels 356b for the TME genes, and/or a second plurality of RNA percentages 358b. It should be appreciated that the second set of features 304b may include additional or fewer features than those shown in
In some embodiments, the initial expression level estimate 352b may be based on (a) the total expression level for the second gene in the biological sample, (b) RNA percentages for the TME cell populations 301b (e.g., RNA percentages for TME cell populations of Type A 322, Type B 324, and Type C 326), and (c) average expression levels of the second gene in each of the TME cell populations. Example techniques for determining an initial expression level estimate are described herein including at least with respect to act 252 of process 250, shown in
In some embodiments, the total expression levels 354b for the tumor genes include all or a subset of the total expression levels included in the expression data 303 for genes 1-M. For example, the subset of the total expression levels may be selected based on a type of cancer that the subject has or is suspected of having. Example techniques for identifying the total expression levels for tumor genes to be included in a set of features are described herein including at least with respect to act 254 of process 250, shown in
In some embodiments, the total expression levels 356b for the TME genes include all or a subset of the total expression levels included in the expression data 303 for genes 1-N. For example, the subset of the total expression levels may be selected based on a type of cancer that the subject has or is suspected of having. Example techniques for identifying the total expression levels for TME genes to be included in a set of features are described herein including at least with respect to act 256 of process 250, shown in
In some embodiments, the second plurality of RNA percentages 358b include RNA percentages for each of multiple cell types in the biological sample. In some embodiments, each of the second plurality of RNA percentages 358b is indicative of the percent of RNA sequence reads that have aligned to the second gene that originate from a particular cell type in the biological sample. For example, the second plurality of RNA percentages may include a first RNA percentage indicative of the percentage of RNA sequence reads that have aligned to the second gene that originate from the first cell type. The first plurality of RNA percentages 358b may include RNA percentages for one or more TME population of different cell types and/or an RNA percentage for tumor cells in the biological sample.
In some embodiments, the Mth set of features 304c includes any suitable features for the Mth gene including, for example, an initial expression level estimate 352c for the Mth gene, at least some of the total expression levels 354c for the tumor genes, at least some of the total expression levels 356c for the TME genes, and/or an Mth plurality of RNA percentages 358c. It should be appreciated that the Mth set of features 304c may include additional or fewer features than those shown in
In some embodiments, the initial expression level estimate 352c may be based on (a) the total expression level for the Mth gene in the biological sample, (b) RNA percentages for the TME cell populations 301b (e.g., RNA percentages for TME cell populations of Type A 322, Type B 324, and Type C 326), and (c) average expression levels of the first gene in each of the TME cell populations. Example techniques for determining an initial expression level estimate are described herein including at least with respect to act 252 of process 250, shown in
In some embodiments, the total expression levels 354c for the tumor genes include all or a subset of the total expression levels included in the expression data 303 for genes 1-M. For example, the subset of the total expression levels may be selected based on a type of cancer that the subject has or is suspected of having. Example techniques for identifying the total expression levels for tumor genes to be included in a set of features are described herein including at least with respect to act 254 of process 250, shown in
In some embodiments, the total expression levels 356c for the TME genes include all or a subset of the total expression levels included in the expression data 303 for genes 1-N. For example, the subset of the total expression levels may be selected based on a type of cancer that the subject has or is suspected of having. Example techniques for identifying the total expression levels for TME genes to be included in a set of features are described herein including at least with respect to act 256 of process 250, shown in
In some embodiments, the Mth plurality of RNA percentages 358c include RNA percentages for each of multiple cell types in the biological sample. In some embodiments, each of the Mth plurality of RNA percentages 358c is indicative of the percent of RNA sequence reads that have aligned to the Mth gene that originate from a particular cell type in the biological sample. For example, the Mth plurality of RNA percentages may include a first RNA percentage indicative of the percentage of RNA sequence reads that have aligned to the Mth gene that originate from the first cell type. The Mth plurality of RNA percentages 358c may include RNA percentages for one or more TME population of different cell types and/or an RNA percentage for tumor cells in the biological sample
In some embodiments, computing device 404 includes software 410 configured to perform various functions with respect to the expression data (e.g., expression data 103 shown in
For example, as shown in
In some embodiments, the feature generation module 460 obtains expression data from the expression data store 446 and/or the sequencing platform 444.
In some embodiments, the feature generation module 460 generates sets of features for respective genes of a set of genes associated with tumor cells (e.g., genes listed in Table 1). For example, the feature generation module 460 may generate a first set of features for a first gene listed in Table 1.
In some embodiments, a set of features generated by the feature generation module 460 includes at least some of the obtained expression data and an initial expression level estimate of a gene in tumor cells of a biological sample. However, it should be appreciated that other information may be included in the set of features.
In some embodiments, the expression data included in the set of features includes total expression levels for genes associated with tumor cells in a biological sample and total expression levels for genes associated with TME cells in the biological sample. For example, the set of features may include a first total expression level for a first gene associated with tumor cells (e.g., genes listed in Table 1) and/or a second total expression level for a second gene associated with TME cells (e.g., genes listed in Table 2).
In some embodiments, the initial expression level estimate of a gene is determined using the feature generation module 460. In some embodiments, determining the initial expression level estimate for a gene includes obtaining average expression levels for the gene in multiple TME cell populations and obtaining RNA percentages for the multiple TME cell populations in the biological sample. For example, the average expression levels may be obtained from the expression data store 446 via the data store interface module 442 and the RNA percentages may be obtained from the cell composition determination module 464. In some embodiments, the feature generation module 460 determines an initial expression level estimate for a gene based on the average expression levels of a gene, the corresponding RNA percentages, and the total expression level of the gene in the biological sample. Techniques for determining an initial expression level estimate are described herein including at least with respect to
In some embodiments, cell composition determination module 464 obtains expression data from sequencing platform 444 and/or expression data 446. In some embodiments, the obtained expression data includes total expression levels for genes associated with tumor and TME cells in a biological sample.
In some embodiments, the cell composition determination module 464 processes the obtained expression data to determine one or more RNA percentages for a biological sample. For example, the cell composition determination module 464 may process the expression data to determine RNA percentages for tumor cells in a biological sample. Additionally or alternatively, the cell composition determination module 464 may process the expression data to determine RNA percentages for TME cells of different types in the biological sample. As nonlimiting examples, the cell composition determination module 464 may determine, for a particular gene, an RNA percentage for neutrophils in the TME and an RNA percentage for B cells in the TME. Techniques for determining RNA percentages are described herein including at least with respect to
In some embodiments, the expression level determination module 462 obtains sets of features from the feature generation module 460, obtains machine learning models from the machine learning model data store 454, and obtains RNA percentages from the RNA percentage determination module 464.
In some embodiments, the obtained machine learning models include a machine learning model for each of multiple genes associated with tumor cells (e.g., genes listed in Table 1). For example, the machine learning models may include a first machine learning model for a first gene listed in Table 1. In some embodiments, the machine learning models may each be trained to estimate a TME expression level of a gene in TME cells of a biological sample. For example, the first machine learning model may be trained to estimate the TME expression of the first gene in TME cells of the biological sample.
In some embodiments, the obtained RNA percentage include an RNA percentage for tumor cells in the biological sample. In some embodiments, the RNA percentage indicates a percent of RNA sequence reads that have aligned a particular gene that originate from tumor cells in the biological sample.
In some embodiments, the expression level determination module 462 processes the obtained features using the machine learning models to determine estimate TME expression levels of genes in TME cells of a biological sample. For example, the expression level determination module 462 may process a first set of features generated for a first gene using a first machine learning model to obtain an output indicative of an estimate TME expression level of the first gene in TME cells of the biological sample. In some embodiments, the expression level determination module 462 may use a different machine learning model to process each set of features (e.g., corresponding to different genes associated with tumor cells).
In some embodiments, the expression level determination module 462 determines tumor expression levels for genes associated with tumor cells based on the outputs of the machine learning models, the obtained RNA percentage for tumor cells in the biological sample, and total expression levels for the genes in the biological sample. For example, the expression level determination module 462 may determine a first tumor expression level for a first gene based on an output of a first machine learning model, the RNA percentage for the tumor cells, and the total expression level of the first gene in the biological sample. Techniques for determining tumor expression levels are described herein including at least with respect to
In some embodiments, the feature generation module 460 and the cell composition determination module 464 obtain the expression data and/or average expression levels via one or more interface modules. In some embodiments, the interface modules include sequencing platform interface module 448 and data store interface module 442. The sequencing platform interface module 448 may be configured to obtain (either pull or be provided) expression data from the sequencing platform 444. The data store interface module 442 may be configured to obtain (either pull or be provided) expression data and/or the average expression levels from the expression data store 446. The data may be provided via a communication network (not shown), such as Internet or any other suitable network, as aspects of the technology described herein are not limited to any particular communication network.
In some embodiments, the expression data store 446 includes any suitable data store, such as a flat file, a data store, a multi-file, or data storage of any suitable type, as aspects of the technology described herein are not limited to any particular type of data store. The expression data store 446 may be part of software 404 (not shown) or excluded from software 404, as shown in
In some embodiments, expression data store 446 stores expression data obtained from biological sample(s) of one or more subjects. In some embodiments, the expression data may be obtained from sequencing platform 444 and/or from one or more public data stores and/or studies. In some embodiments, a portion of the expression data may be processed by the feature generation module 460 to generates sets of features to be provided as input to machine learning models. In some embodiments, a portion of the expression data may be processed by the cell composition determination module 464 to determine RNA percentages for cell populations in a biological sample. In some embodiments, a portion of the expression data may be processed by the expression level determination module 462 to determine tumor expression levels of genes in tumor cells of a biological sample. In some embodiments, a portion of the expression data may be used to train one or more machine learning models (e.g., with the machine learning classifier training module 464).
In some embodiments, the expression level determination module 462 obtains the machine learning models via the data store interface module 442. The data store interface module 442 may be configured to obtain (either pull or be provided) machine learning models from the machine learning model data store 454. The machine learning models may be provided via a communication network (not shown), such as Internet or any other suitable network, as aspects of the technology described herein are not limited to any particular communication network.
In some embodiments, machine learning classifier data store 454 includes any suitable data store, such as a flat file, a data store, a multi-file, or data storage of any suitable type, as aspects of the technology described herein are not limited to any particular type of data store. The machine learning classifier data store 454 may be part of software 404 (not shown) or excluded from software 410, as shown in
In some embodiments, the machine learning model data store 454 stores a plurality of machine learning models used to determine TME expression level estimates for genes in TME cells of a biological sample. In some embodiments, each machine learning model corresponding to a gene of a set of genes associated with tumor cells (e.g., genes listed in Table 1).
In some embodiments, machine learning model training module 452, referred to herein as training module 452, is configured to train the one or more machine learning models used to estimate TME expression levels for genes in TME cells of the biological sample. This may include training a first machine learning model to estimate a TME expression level for a first gene in TME cells of a biological sample. In some embodiments, the training module 452 trains a machine learning model using a training set of expression data. For example, the training module 452 may obtain training data via data store interface module 442. In some embodiments, the training module 452 may provide trained machine learning models to the machine learning model data store 454 via data store interface module 442. Techniques for training machine learning models are described herein including at least with respect to
In some embodiments, the determined tumor expression levels may be output from the expression level determination module 462. For example, the tumor expression level estimates may be output to a user 456 via user interface 458. Additionally or alternatively, the determined tumor expression levels may be stored in memory.
User interface 448 may be a graphical user interface (GUI), a text-based user interface, and/or any other suitable type of interface through which a user may provide input. For example, in some embodiments, the user interface may be a webpage or web application accessible through an Internet browser. In some embodiments, the user interface may be a graphical user interface (GUI) of an app executing on the user's mobile device. In some embodiments, the user interface may include a number of selectable elements through which a user may interact. For example, the user interface may include dropdown lists, checkboxes, text fields, or any other suitable element.
As shown in
In some embodiments, the expression data 502 is used to obtain, for different genes (e.g., genes 1-M) RNA percentages 506 for different cell populations in the biological sample. In some embodiments, the expression data 502 is processed using one or more machine learning models 504 to obtain the RNA percentages 506. For example, the expression data 502 may be processed using the techniques described herein including at least with respect to
In some embodiments, the RNA percentages 506 include RNA percentages for tumor cells and for TME cells of different types. For example, the RNA percentages include an RNA percentage for TME cells of Type A, an RNA percentage for TME cells of Type B, and an RNA percentage of TME cells of Type C. It should be appreciated that this is meant to be an illustrative example, and any suitable number of RNA percentages corresponding to any suitable number of cell populations in the biological sample may be included in RNA percentages 506.
The average expression levels 508 include the average expression levels of genes associated with tumor cells (e.g., genes 1-M) in each of multiple different cell types (e.g., TME cell types). For example, average expression levels for genes 1-M in TME cells of Type A, TME cells of Type B, and TME cells of Type C. In some embodiments, as described herein including at least with respect to
In some embodiments, the average expression levels 508 and the RNA percentages 506 are used to generate an initial expression level estimate 510 of the first gene in TME cells of the biological sample. For example, in some embodiments, this may include determining a weighted sum using the average expression levels 508 for the first gene in the different TME cell populations (e.g., Type A, Type B, and Type C) and the corresponding RNA percentages for those cell populations. For example, determining the initial expression level estimate 510 of the first gene in the TME cells may include using Equation 3.
In some embodiments, the expression data 502 and the initial expression level estimate 510 of the first gene in the TME cells are used to determine the initial expression level estimate 512 of the first gene in the tumor cells of the biological sample. For example, in some embodiments, the initial expression level estimate 510 of the first gene in the TME cells of the biological sample is subtracted from the total expression level 502a of the first gene in the biological sample. For example, determining the initial expression level estimate 510 of the first gene in the tumor cells may include using Equation 4.
In some embodiments, the initial expression level estimate 512 of the first gene in the tumor cells and at least some of the expression data 502 are included in the first set of features 516. For example, at least a subset (e.g., some or all) of the total expression levels for the genes associated with tumor cells (e.g., total expression level 502a) and at least a subset of the total expression levels for the genes associated with TME cells are included in the first set of features 516.
Additionally or alternatively, the RNA percentages 506 are included in the first set of features 516. For example, at least a subset (e.g., some or all) of the RNA percentages 506 are included in the first set of features 516.
In some embodiments, the first set of features 516 is provided as input to the first machine learning model 518 to obtain an output 520 indicative of the TME expression level estimate of the first gene in TME cells of the biological sample.
In some embodiments, the output 520, at least some of the expression data 502, and one or more of the RNA percentages 506 are used to determine the tumor expression level of the first gene in the tumor cells of the biological sample. For example, the TME expression level estimate may be subtracted from the total expression level 502a of the first gene in the biological sample. The difference may, in some embodiments, be divided by the RNA percentage of tumor cells in the biological sample to obtain the tumor expression level 522. For example, determining the tumor expression level 522 for the first gene may include using Equations 1 and 2.
As shown in
In some embodiments, the expression data 552 is used to obtain the RNA percentages 556 for different cell populations in the biological sample. In some embodiments, this includes processing the expression data using a machine learning model to obtain the RNA percentages 556, as described herein including at least with respect to
In some embodiments, the RNA percentages 556 includes an RNA percentage for the tumor cells and for TME cell populations in the biological samples. For the purpose of this example, the biological sample includes tumor cells and TME cells including neutrophils, NK cells, and fibroblasts. The RNA percentages 556 are indicative of a percent of RNA sequence reads aligned to the respective gene (e.g., XRCC1, AREG, CDH1, etc.) that originated from a respective cell population (e.g., neutrophils, NK cells, fibroblasts, tumor cells, etc.) In this example, for the XRCC1 gene, 6% of the RNA sequence reads that aligned to the XRCC1 gene originated from neutrophils, 4% originated from NK cells, 10% originated from fibroblasts, and 80% originated from tumor cells.
In some embodiments, average expression levels 558 are obtained for each gene associated with tumor cells in different cell population in the biological sample. For example, for the XRCC1 gene, the average expression levels 558 include an average expression level of the XRCC1 gene in each of the TME cell populations (e.g., the neutrophils, NK cells, and fibroblasts) in the biological sample.
In some embodiments, the RNA percentages 556 and the average expression levels 558 are used to determine an initial TME expression level estimate 560 of XRCC1. As shown in
In some embodiments, at least some of the expression data 552 and the initial TME expression level estimate 560 of the XRCC1 gene are used to determine the initial tumor expression level estimate 562 of the XRCC1 gene. For example, as shown, the initial TME expression level estimate 560 of the XRCC1 gene (5.38) may be subtracted from the total expression level of the XRCC1 gene (81.7) in the biological sample to obtain the initial tumor expression level estimate 562 of the XRCC1 gene (72.8).
In some embodiments, at least some of the expression data 552, at least some of the RNA percentages 556, and the initial tumor expression level estimate 562 are included in the set of features 566 for the XRCC1 gene. For example, the expression data 552 included in the set of features 566 may include all of the total expression levels for the tumor genes and/or all of the total expression levels for the TME genes. Additionally or alternatively, the expression data 552 included in the set of features 566 may include only a subset of the total expression levels for the tumor genes (e.g., including the total expression level for the XRCC1 gene) and/or only a subset of the total expression levels for the TME genes.
In some embodiments, the set of features 566 is provided as input to a machine learning model 568 for the XRCC1 gene to obtain an output 570 indicative of the TME expression level estimate of XRCC1 in the TME cells of the biological sample. For example, the TME expression level estimate may indicate an estimated expression of XRCC1 in the TME cells of the biological sample.
In some embodiments, the output 570, expression data 552, and RNA percentages 556 are used to determine the tumor expression level 572 of the XRCC1 gene in tumor cells of the biological sample. In some embodiments, as shown, determining the tumor expression level 572 includes subtracting the TME expression level estimate of the XRCC1 gene from the total expression level of the XRCC1 gene in the biological sample (81.7) and dividing the difference by the RNA percentage of tumor cells (0.80) in the biological sample. For example, as shown, the TME expression level of the XRCC1 gene is subtracted from 81.7 and divided by 0.80 to obtain the tumor expression level of the XRCC1 gene.
Machine Learning Model Training
Process 600 may be performed by any suitable computing device(s). For example, process 600 may be performed by a laptop computer, a desktop computer, one or more servers, in a cloud computing environment, computing device 2400 as described herein within respect to
Process 600 begins at act 602 where training data is obtained. In some embodiments, the training data includes simulated expression data associated with one or more training samples (e.g., biological samples). In some embodiments, the simulated expression data may include expression data that is generated partially in silico. For example, the simulated expression data may include data that was obtained by sampling reads from multiple expression data sets from purified cell type samples. In some embodiments, the simulated expression data may comprise expression data measured in TPM. For example, the simulated expression data includes simulated expression data for genes associated with tumor cells and simulated expression data for genes associated with TME cells. For example, genes associated with tumor cells may include genes listed in Table 1 and the gene associated with TME cells may include genes listed in Table 2.
In some embodiments, the training data includes simulated expression data for genes associated with tumor cells and simulated expression data for genes associated with TME cells. For example, genes associated with tumor cells may include genes listed in Table 1 and the gene associated with TME cells may include genes listed in Table 2. In some embodiments, the simulated expression data for the genes associated with tumor cells includes total expression levels for the genes in the training sample(s). For example, the simulated expression data may include a first total expression level for a first gene associated with tumor cells. In some embodiments, the simulated expression data for the genes associated with TME cells includes total expression levels for genes in the training sample(s). For example, the simulated expression data may include a second total expression level for a second gene associated with TME cells.
In some embodiments, the training data may be generated as part of act 602. As described herein including at least with respect to
The training data may be obtained in any suitable manner at act 602. For example, the training data may be stored on at least one storage medium (e.g., in one or more files, or in a database). In some embodiments, the at least one storage medium storing the training data may be local to the computing device (e.g., stored on the same at least one non-transitory storage medium), or may be external to the computing device (e.g., stored in a remote database or a cloud storage environment). The training data may be stored on a single storage medium, or may be distributed across multiple storage mediums.
In some embodiments, act 602 may further comprise pre-processing the training data in any suitable manner. For example, the training data may be sorted, combined, organized into batches, filtered, or pre-processed with any other suitable techniques. The pre-processing may make the training data suitable to be processed using the one or more machine learning models, for example. In some embodiments, the training data may be split into separate training, validation, and holdout datasets.
At act 604, generating a training set of features is formed using the training data. In some embodiments, generating the training set of features includes obtaining an initial expression level estimate of the gene in the tumor cells of the training sample(s). The initial expression level estimate may be included in the training set of features. In some embodiments, generating the training set of features includes including, in the training set of features, at least some of the total expression levels for genes associated with tumor cells and at least some of the total expression levels for genes associated with TME cells. For example, the total expression levels may include the total expression levels obtained at act 602. In some embodiments, generating the training set of features includes including, in the training set of features, RNA percentages obtained for the biological sample. Techniques for generating features are further described herein including at least with respect to
At act 606, a first machine learning model is trained to estimate a TME expression level of a first gene in TME cells of the training sample(s). In some embodiments, at sub-act 606a, the training set of features may be provided as input to a first machine learning model (e.g., the first machine learning model described herein including with respect to
At sub-act 606b, training the first machine learning model may proceed with updating parameters using the estimate of the TME expression level output at sub-act 606a. In some embodiments, the estimate of the TME expression level may be compared to a known value for the TME expression level of the first gene in the TME cells as part of sub-act 606b. For example, a loss function may be applied to the estimated value and the known value in order to determine a loss associated with the estimated value. In some embodiments, the loss may be used to update the parameters of the model. For example, a gradient descent, or any other suitable optimization technique, may be applied in order to update the parameters of the model so as to minimize the loss.
The first machine learning model may process its input using any suitable techniques, as described herein. In some embodiments, the first model may use a gradient boosting machine learning technique. For example, the first model may comprise an ensemble of weak prediction models, such as decision trees, or any other suitable prediction models, which may be combined in an iterative fashion using a gradient boosting algorithm. In some embodiments, a gradient boosting framework such as XGBoost, LightGBM, Catboost, or Adaboost may be used as part of training the first model.
In some embodiments, for a given machine learning model, sub-acts 606a and 606b may be repeated multiple times (e.g., at least one hundred, at least one thousand, at least ten thousand, at least one hundred thousand, or at least one million times). In some embodiments, sub-acts 606a and 606b may be repeated for a set number of iterations or may be repeated until a threshold is surpassed (e.g., until loss decreases below a threshold value).
At act 608, process 600 proceeds with determining whether there are additional machine learning models to be training. For example, the plurality of machine learning models may include a second machine learning model for a second gene associated with tumor cells. Acts 602-606 may be repeated to train the second machine learning model to estimate the TME expression level of the second gene in the TME cells of the training sample(s). Additionally or alternatively, the plurality of machine learning models may include a third machine learning model for a third genes associated with tumor cells. Acts 602-606 may be repeated to train the third machine learning model to estimate the TME expression level of the third gene in the TME cells of the training sample(s).
If there are no remaining machine learning models to be trained, in some embodiments, the trained plurality of machine learning models are output. In some embodiments, outputting trained plurality of machine learning models may comprise: storing one or more of the models in at least one non-transitory computer-readable storage medium (e.g., memory) for subsequent access, providing the model(s) to a recipient (e.g., transmitting data associated with the model(s) to a recipient using any suitable communication network or other means), displaying information associate with the model(s) to a user via a graphical user interface, and/or any other suitable manner of outputting the trained models, as aspects of the technology described herein are not limited in this respect. For example, the trained machine learning models may be stored in a data store, such as the machine learning model data store 454 described herein including at least with respect to
Training Data Generation
Data Collection, Analysis, and Preprocessing
According to some embodiments, the expression data may be obtained as described herein including at least with respect to
In some embodiments, selection of datasets may be based on both biological and bioinformatic parameters. For example, datasets with samples cultivated in conditions close to normal physiological conditions may be used. In some embodiments, datasets with abnormal stimulation were excluded, like datasets of CD4+ T-cells hyper stimulated with phorbol 12-myristate 13-acetate and ionomycin activation or macrophages co-cultured with an excessive number of bacterial cultures. In some embodiments, only those samples having at least 4 million coding read counts were used.
In some embodiments, quality control may be performed on the expression data prior to construction of the artificial mixes (e.g., to exclude strange or unreliable datasets). For example, if some samples of CD4+ T cells show no or very low expression of CD45, CD4 or CD3 genes, they may be excluded. The same may done for other cell types, in some embodiments. For example, samples for some cell types may be excluded if they significantly express genes that are not typical for that type of cell (e.g., if in a sample of T cells, CD19, CD33, MS4A1, etc. were expressed in significant amounts, while in most other T cell samples these expressions were low). In some embodiments, samples of CD4+ T cells may be removed if they express significant amounts of CD8 genes. In some embodiments, several methods of expression analysis like t-SNE or PCA with different gene sets may be used to visualize the similarities and differences between datasets. If a particular cell type from one dataset fails to cluster with the same cell type in the other datasets (e.g., in a t-SNE, PCA, or other plot), then the one dataset may be further analyzed as part of quality control, and some or all of the data from that dataset may be excluded.
Mixes Construction
According to some embodiments, a variety of artificial mixes of expression data (e.g., representing simulated tumor tissue) may be constructed using samples prepared as described herein above. Artificial mixes may be generated using sample expressions in TPM (transcripts per million) units, such that the gene expressions for an overall sample are formed as a linear combination of the expressions of individual cells from that sample. In some embodiments, expression data from samples of various cell types may be mixed in predetermined proportions. As shown in
Referring now to branch 720, an exemplary process for generating simulated TME expression data is shown. In the illustrated example, samples of each cell type (e.g., samples of expression data, such as of genes GSE1, GSE2, GSE3, or GSE4, as shown) may be rebalanced by datasets (e.g., reducing the weight of datasets with a large number of samples) and subtypes (e.g., changing the proportions of subtypes of a sample). Techniques for rebalancing are described herein including with respect to the “Rebalancing by datasets” and “Rebalancing by subtypes” sections. For each cell type, multiple samples may then be randomly selected and averaged. Then, for some or all of the cell types being used, the rebalanced/averaged samples may be mixed together in particular proportions (e.g., so as to simulate a real tumor microenvironment).
Referring now to branch 710, an exemplary process for generating simulated tumor expression data is shown. In the illustrated example, random samples of cancer cells (e.g., NSCLC, ccRCC, Mel, HNCK, etc.) may be selected. Then, hyperexpression noise may be added to the resulting expression data to account for abnormal expression of genes by tumor cells. For example, tumor cells sometimes express genes which are ordinarily absent in the parental cell type. When this is the case for specific, semi-specific, or marker genes that are linked to immune or stromal cells within the TME, the overexpressed genes may interfere with the deconvolution techniques described herein. Regardless of whether hyperexpression noise is included, the result of branch 710 may be simulated tumor expression data.
As shown in
The inventors have recognized and appreciated that, when creating artificial mixes, it may be desirable to use different cells of the same type from different samples. Using a small number of samples for the mixes, or even just one sample for each cell type, would provide poor performance on real tumor samples (e.g., due to the variability of cell states and their expressions, as well as noise due to limited numbers of read counts for different expressions, alignment errors and other causes of technical noise). Therefore, when creating artificial mixtures, the inventors have recognized that is may be desirable to use as many available cell samples as possible.
Accordingly, for this example, a large number of RNA-seq samples (e.g., at least one hundred, at least five hundred, at least one thousand, at least two thousand, or at least five thousand samples) of various cell types were collected. In some embodiments, a number of datasets of tumor cells (e.g., pure cancer cells for various diagnoses, cancer cell lines or sorted from tumors) may also be collected. For each cell type, there may be a corresponding number of samples from different datasets.
In some embodiments, as described herein including with respect to
Averaging of Samples
In some embodiments, multiple samples for each cell type may be averaged in any suitable manner (e.g., to improve the quality of samples before adding artificial noise). For example, in some embodiments, averaging may be performed in groups of two, such that an averaged sample of 4 million reads may contain information on 8 million reads. In some embodiments, averaging across multiple samples may reduce the noise in the expression caused by technical factors during sequencing.
Samples Rebalancing
Since different datasets and cell subtypes can vary significantly in the number of available cell samples, in some embodiments the number of samples may be rebalanced. As described herein below, in one example, the samples may be rebalanced by datasets, then by cell subtypes.
Rebalancing by Datasets
In some embodiments, the number of samples of sorted cells in datasets may range from one to several hundred (e.g., at least five, at least ten, at least 50, or at least 100 samples). Typically, each dataset may contain samples of one or two cell types, sorted and sequenced in the same way. Cell samples within the same dataset may also have specific conditions, such as a specific set of markers for sorting or a specific disease of patients from whom the cells were taken. Datasets with a large number of samples can lead to overtraining of models for such datasets. To reduce the weight of datasets with a large number of samples, samples of all datasets are resampled in order to rebalance by datasets.
For example, in some embodiments, for each dataset the number of samples are resampled with replacement to number Ndataset,new.
Where Nmax is number of samples in the largest dataset (e.g., for the particular cell type) and Ndataset,old is the original number of samples in the dataset. The rebalance parameter in the equation is a value in the range [0, 1], where 0 means there is no change in the number of samples, and 1 means that for each dataset there will be the same number of samples. In some embodiments, the rebalancing parameter may be selected during training.
Rebalancing by Cell Subtypes
For a number of cell types, in addition to samples of this type, there may also be samples of more specific subtypes. The number of available subtype samples may not coincide with those ratios that are specified during the formation of mixes with these subtypes, in some cases. Therefore, when creating mixes for the cell type, samples of its subtypes may be rebalanced.
For example, in some embodiments, there may be significantly more CD4+ T cells (and T helpers with Tregs) samples available than CD8+ T cells. In this case, to form an average T cells sample, proportions of CD4+ and CD8+ T cells samples may be changed before the random selection of samples. For example, the proportions may be chosen similar to the ratios of the predicted average RNA fractions for the TCGA or PBMC samples for these cell types. In some embodiments, the predictions may be obtained using one or more linear models trained on mixes with equal cell proportions.
The subtype rebalancing algorithm may be as follows. To rebalance each subtype for a given type, resample with replacement a number of samples equal to:
Where Psubtype is a number reflecting the proportion of a given subtype (e.g., the proportion of this subtype among all subtypes for the given type, which may be represented as the number of samples for the subtype divided by the total number of samples for the type); msize is the maximum number of samples among all the subtypes for the given type, and min_P is the minimum number Psubtype between all subtypes. According to some embodiments, the rebalancing operation may be performed recursively for all nested subtypes (e.g., subtypes which themselves have subtypes
TME Cells Proportion Generation
According to some embodiments, the resulting samples of different cell types may be mixed with one another in random ratios in order to generate the simulated TME expression data. For example, a first set of artificial mixes may be generated using random proportions of each cell type:
Where Rcell is a random number distributed uniformly from 0 to 1 and Kcell is the coefficient for the particular cell type.
According to some embodiments, the coefficient Kcell in the above equations may be chosen so that the most likely ratios of cells mRNA are close to what is observed in TCGA or PBMC samples. These approximate ratios may be calculated from the TCGA or PBMC samples, using models trained without using such ratios. For example, a vector of numbers may be used, reflecting approximate proportions for a given type of tissue. Each number of the vector is multiplied by a random number from 0 to 1. The resulting coefficients are normalized to the sum and used in a linear combination. In some embodiments, Kcell may be selected from Table 5, which specifies, for each of multiple cell types, the most likely proportion of the cell type based on tumor tissue and blood (PBMC).
Noise Generation
As shown in
T
i
mix
=T
i
mix
+Noise(Timix
In some embodiments, expression of each gene may contribute noise to the overall tissue expression. For example, the expression of a single gene (Tij) could be represented as a sum:
T
i
j=μT
Where uT
In some embodiments, a relative standard deviation of Poisson technical noise (δP
δi=√{square root over (δP
Technical variability may result from differences in sample and library preparation (non-Poisson noise) and random transcript selection on the sequencer track due to limited coverage (Poisson noise). Many cell types of the TME may typically occupy a small fraction in tumor samples. Therefore, the inventors have recognized and appreciated that it may be important to consider different levels of variability or noise for different genes, depending on the level of their expression. For example, in some embodiments, a TPM-based mathematical noise model is provided, which accounts for technical noise (both Poisson and non-Poisson). In some embodiments, this model of variability may be added to the artificial mixes generated to train the machine learning models, as described herein. In some embodiments, technical non-Poisson noise is assumed to be normally distributed. These may account for variability in the library preparation, alignment or variations in human handling of different samples. In contrast, Poisson noise is a type of technical noise which may be associated with the sequencing coverage or number of read counts and may not be normally distributed. The resulting dependence of technical noise on coverage and gene expression could be expressed by a formula:
Where i is an effective gene length,
In addition to technical noise, biological noise, which may be associated with different activated states of a cell, can contribute to the overall variance in an RNA-seq sample. In some embodiments, there may be no need to add biological noise to artificial mixes, as this noise may already be present through the use of RNA-seq data derived from cell subsets representing a variation of biological states.
In some embodiments, the analysis of noise contribution due to single gene expression, as described herein, may be applied to simulate technical and biological noise in artificial mixes. For example, noise may be added to total gene expression in two summands:
Where ξP, ξN˜N(0,1), β is the coefficient of Poisson noise level coefficient, and γ is the coefficient of uniform level non-Poisson noise.
The noise model described herein may be used to add technical (both Poisson and non-Poisson) variation to artificial mixes. This results in artificial mixes which better mimic real tissues. Improved artificial mixes may subsequently be used to train the deconvolution algorithm (e.g., as described herein including with respect to
Additional examples and techniques for generating training data including simulated expression data are described in in the “Cellular Deconvolution” section and in U.S. Patent Publication No. 2021-0287759, entitled “SYSTEMS AND METHODS FOR DECONVOLUTION OF EXPRESSION DATA”, the entire contents of which is herein incorporated by reference in its entirety.
Cellular Deconvolution
At act 802, the process 800 begins with obtaining expression data for a biological sample from a subject. In some embodiments, obtaining expression data may include obtaining expression data from a biological sample that has been previously obtained from a subject using any suitable techniques. In some embodiments, obtaining the expression data may include obtaining expression data that has been previously obtained from a biological sample (e.g., obtaining the expression data by accessing a database.) In some embodiments, the expression data is RNA expression data. Examples of RNA expression data are provided herein. In some embodiments, the subject may have, be suspected of having, or be at risk of having cancer. The biological sample may comprise a biopsy (e.g., of a tumor or other diseased tissue of the subject), any of the embodiments described herein including with respect to the “Biological Samples” section, or any other suitable type of biological sample. In some embodiments, the origin or preparation of the expression data may include any of the embodiments described with respect to the “Expression Data” and “Obtaining Expression Data” sections. For example, the expression data may be RNA expression data extracted using any suitable techniques. As another example, the expression data obtained at act 802 may comprise RNA expression data measured in TPM.
In some embodiments, the expression data may be stored on at least one storage medium and accessed as part of act 802. For example, the expression data may be stored in one or more files or in a database, then read. In some embodiments, the at least one storage medium storing the RNA expression data may be local to the computing device (e.g., stored on the same at least one non-transitory storage medium), or may be external to the computing device (e.g., stored in a remote database or a cloud storage environment). The expression data may be stored on a single storage medium or may be distributed across multiple storage mediums.
In some embodiments, the expression data of act 802 may include first expression data associated with a first set of genes associated with a first cell type (e.g., a cell type of the cell types and/or subtypes being analyzed in the biological sample). In some embodiments, the first set of genes may comprise genes that are specific and/or semi-specific to the first cell type. For example, for the endothelium cell type, the set of genes may comprise: ANGPT2, APLN, CDH5, CLEC14A, ECSCR, EMCN, ENG, ESAM, ESM1, FLT1, HHIP, KDR, MMRN1, MMRN2, NOS3, PECAMI, PTPRB, RASIPI, ROBO4, SELE, TEK, TIE1, and/or VWF. In some embodiments, the first set of genes may be the same as a set of genes, or a subset of a set of genes, used as part of training a corresponding non-linear regression model for the cell type.
At act 804, the process 800 proceeds with determining first RNA percentages for at least the first cell type. As shown, determining first RNA percentages for the first cell type may comprise processing first expression data associated with a first set of genes for the first cell type with a first non-linear regression model (e.g., of the one or more non-linear regression models) to determine the first RNA percentages for the first cell type. For example, the first expression data may be provided as input to the first non-linear regression model. In some embodiments, other information may be provided as part of the input to the non-linear regression model. For example, a median of the expression data may be included as part of the input to the non-linear regression model. In some embodiments, any other suitable information may additionally or alternatively be provided as part of the input (e.g., an average of the expression data, a median or average of a subset of the expression data, or any other suitable statistics derived from or otherwise relating to the expression data).
In some embodiments, parts of act 804 may be repeated and/or performed in parallel for each cell type and/or subtype being analyzed. For example, a subset of the expression data may be provided as input to each non-linear regression model for each respective cell type and/or subtype.
In some embodiments, the output of the non-linear regression model may comprise information representing estimated percentages of RNA from the first cell type in the sample.
In some embodiments, process 800 then proceeds to act 806 for outputting the first RNA percentages. Regardless of the architecture or input(s) to the non-linear regression models, including the non-linear regression model for the first cell type, the output(s) of the one or more non-linear regression models may be combined, stored, or otherwise post-processed as part of process 800. For example, the RNA percentages for each cell type may be stored locally on the computing device used to perform process 800 (e.g., on the non-transitory storage medium). In some embodiments, the RNA percentages may be stored in one or more external storage mediums (e.g., such as a remote database or cloud storage environment).
In some embodiments, the example implementation 820 begins at act 812, where expression data is obtained for a biological sample from a subject. Obtaining expression data for a biological sample from a subject is described herein above including with respect to act 802 of
In some embodiments, act 812 may include obtaining first expression data and second expression data. The first expression data may be associated with a first set of genes that is associated with a first cell type, while the second expression data may be associated with a second set of genes that is associated with a second cell type. For example, the first expression data may be associated with a first set of genes that is associated with B cells, while the second expression data may be associated with a second set of genes that is associated with T cells. Additionally or alternatively, the first expression data may be associated with a first set of genes associated with a first cell subtype, while the second expression data may be associated with a second set of genes associated with a second cell subtype. For example, the first expression data may be associated with a first set of genes associated with CD4+ cells, while the second expression data may be associated with a second set of genes associated with CD8+ cells.
In some embodiments, the example process 820 proceeds to act 814, where the expression data is pre-processed. In some embodiments, the pre-processing may make the expression data suitable to be processed using the one or more non-linear regression models. For example, the expression data may be sorted, combined, organized into batches, filtered, or pre-processed with any other suitable techniques.
After the expression data is pre-processed, example process 820 proceeds to act 816, where a plurality of RNA percentages may be determined for a plurality of cell types using the expression data and one or more non-linear regression models (e.g., at least five, at least ten, at least fifteen, models.)
In some embodiments, a separate non-linear regression model may be used to estimate RNA percentages for each cell type and/or subtype. For example, act 816 may include act 816a and act 816b, each of which includes using a separate non-linear regression model trained for determining RNA percentages for the first and second cell types and/or subtypes, respectively. Act 816a includes determining first RNA percentages for the first cell type using the first expression data and a first non-linear regression model. Act 816b includes determining second RNA percentages for the second cell type using the second expression data and a second non-linear regression model. In some embodiments, act 816 may include only one of acts 816a and 816b. In some embodiments, act 816 may include using one or more additional non-linear regression models for determining RNA percentages for one or more other cell types (e.g., a third cell type or subtype). An example implementation of act 816a is described herein including with respect to
In some embodiments, the RNA percentages obtained at act 816 are output at act 818 of process 820.
In some embodiments, the first expression data may include first expression data associated with a first set of genes associated with the first cell type, as well as second expression data associated with a second set of genes associated with the first cell type.
In some embodiments, the example implementation begins at act 832, for predicting first values for the estimated percentages of RNA from the first cell type, using a first sub-model. In some embodiments, the first expression data associated with the first set of genes and/or any other input information may be provided as input to the first sub-model of the non-linear regression model, and the output may be one or more predicted percentages of RNA from the first cell type.
In some embodiments, after predicting the first values, the example implementation proceeds to act 834, for predicting second values for the estimated percentage of RNA from the first cell type, using a second sub-model. In some embodiments, the second expression data associated with the second set of genes may be provided as input to the second sub-model of the non-linear expression model in addition to the prediction from the first sub-model and/or any other input information provided at the first sub-model. Additionally or alternatively, the first expression data associated with the first set of genes may be provided as input to the second sub-model. According to some embodiments, predictions from multiple non-linear regression models (e.g., the output of the first sub-model of each non-linear regression model for each cell type) may be provided as input to the second sub-model of the non-linear regression model for the first cell type. Regardless of the input to the second sub-model, the output of the second sub-model of the non-linear regression model may be an estimated percentage of RNA from the first cell type in the sample. The output of the second sub-model may comprise the output of the non-linear regression model for the first cell type, in some embodiments.
In some embodiments, the non-linear regression model may comprise more than two sub-models. For example, the second sub-model may be repeated any number of times, with the predictions from one or more of the prior sub-models being included as input each time.
Example Experiments
Experiments were undertaken to test the performance of the machine learning techniques described herein.
Preparation of Datasets
Several types of datasets were used for model development and evaluation.
First, artificial transcriptomes created from different solid tumor cell lines with the addition of various TME cellular populations (B cells, plasma B cells, CD4+ T cells, CD8+ T cells, macrophages, fibroblasts, endothelium, neutrophils, NK cells, monocytes) were used. Cell proportions were randomly assigned to each TME cell type so that their sum varied from 10% to 60%, while tumor fraction constituted 40-90% of the total sample. Overall, 900000 artificial transcriptomes were generated for training and 100 samples for validation using 7,114 samples of purified TME cell types and 3,143 samples of cancer cell lines.
Single-cell data for different cancer types was used to test the models. For melanoma, glioblastoma and head and neck cancer patient-specific single-cell data scRNAseq-based artificial mixtures were generated following the same strategy described above. Additionally, for lung cancer a public dataset of patient-specific single-cell data without an additional step of artificial transcriptomes generation was used alongside with single-cell data for non-small-cell lung carcinoma.
In vitro experiments were also conducted for additional evaluation of the models, in which different proportions of RNA extracted from PBMCs were mixed with RNA extracted from three cancer cell lines: COL0829 (cutaneous melanoma), MCF-7 (invasive ductal carcinoma), and K562 (chronic myeloid leukemia). The fraction of tumor cell RNA in these in vitro mixtures constituted 25%-95%. After that, gene expression was quantified, and model predictions were compared with the pure cancer cell line expressions.
Model Validation: Validation on Artificial Transcriptomes
First, the models were validated on the dataset of artificial transcriptomes, in which the percentage of tumor cells varied from 40% to 90%.
Next, the machine learning techniques were tested on single-cell data from different cancer types.
Model Testing on In Vitro Data
Model evaluation on in vitro data showed that the machine learning techniques described herein improved the concordance correlation coefficient and mean absolute error (MAE) for at least 74 tumor biomarkers (Table 6). Overall, as shown in
For example, as shown in
Example Model Parameters
Each machine learning model trained and validated in the above-described experiments comprises a gradient boosted machine learning model trained using the LightGBM, gradient boosting framework.
Table 7 lists example parameters for such a machine learning model:
Tumor-specific gene expression analysis plays a decisive role in a wide range of biomedical issues, including, for example, adjustment of personalized genetic-based treatment strategies, determination of prognosis, assessing clinical trial endpoints, identifying new biomarkers, and correcting therapy indications for previously-known biomarkers.
In some embodiments, the effectiveness of a targeted anti-tumor therapy (e.g., monoclonal antibody therapy and CAR-T) depends on the relative abundance of the therapeutic target in tumor cells. As an example, HERCEPTIN® (trastuzumab) is approved by FDA to treat certain breast and stomach cancers but only in patients whose tumors overexpress HER2 (the product of ERBB2 gene), thereby reaffirming the need for accurate determination of intra-tumoral ERBB2 expression. Correct tumor expression determination by the machine learning techniques described herein may allow for avoiding TME-caused false-positive results and the following false-positive indications for HERCEPTIN® (trastuzumab).
An additional example that demonstrates the range of such false-positive errors is shown for PIK3CD, a target for Idelalisib—FDA approved PI3K selective inhibitor.
Despite the moderate initial expression values, the expression of PIK3CD after the application of the machine learning techniques, described herein, is barely detectable, leading to a lack of indications for the use of PIK3CD-specific therapeutics. In the same way, the techniques described herein can be used to correct therapeutic recommendations for the medications targeting any of the genes from Table 6.
An even more pronounced effect of using the developed algorithm can be observed in the example for MMP2 (matrix metalloproteinase-2), an enzyme that in humans is encoded by the MMP2 gene.
The high level of MMP2 was shown to be associated with both improved disease-free survival and overall survival in breast cancer patients receiving bevacizumab- and trastuzumab-based neoadjuvant chemotherapy. The dramatic change of the gene expression level would entail revising the prognosis for the sample/patient. In the same way, the machine learning techniques described herein can be used to correct prognostic assessments for any of the prognostic/predictive biomarkers listed in Table 6.
Biological Samples
Any of the methods, systems, or other claimed elements may use or be used to analyze a biological sample from a subject. In some embodiments, a biological sample is obtained from a subject having, suspected of having cancer, or at risk of having cancer. The biological sample may be any type of biological sample including, for example, a biological sample of a bodily fluid (e.g., blood, urine or cerebrospinal fluid), one or more cells (e.g., from a scraping or brushing such as a cheek swab or tracheal brushing), a piece of tissue (cheek tissue, muscle tissue, lung tissue, heart tissue, brain tissue, or skin tissue), or some or all of an organ (e.g., brain, lung, liver, bladder, kidney, pancreas, intestines, or muscle), or other types of biological samples (e.g., feces or hair).
In some embodiments, the biological sample is a sample of a tumor from a subject. In some embodiments, the biological sample is a sample of blood from a subject. In some embodiments, the biological sample is a sample of tissue from a subject.
A sample of a tumor, in some embodiments, refers to a sample comprising cells from a tumor. In some embodiments, the sample of the tumor comprises cells from a benign tumor, e.g., non-cancerous cells. In some embodiments, the sample of the tumor comprises cells from a premalignant tumor, e.g., precancerous cells. In some embodiments, the sample of the tumor comprises cells from a malignant tumor, e.g., cancerous cells. In some embodiments, the sample of tumor can include a mixture of cancerous, non-cancerous, and/or precancerous cells.
Examples of tumors include, but are not limited to, adenomas, fibromas, hemangiomas, lipomas, cervical dysplasia, metaplasia of the lung, leukoplakia, carcinoma, sarcoma, germ cell tumors, melanomas, mesotheliomas, gliomas, and blastoma.
A sample of blood, in some embodiments, refers to a sample comprising cells, e.g., cells from a blood sample. In some embodiments, the sample of blood comprises non-cancerous cells. In some embodiments, the sample of blood comprises precancerous cells. In some embodiments, the sample of blood comprises cancerous cells. In some embodiments, the sample of blood comprises blood cells. In some embodiments, the sample of blood comprises red blood cells. In some embodiments, the sample of blood comprises white blood cells. In some embodiments, the sample of blood comprises platelets. Examples of cancerous blood cells include, but are not limited to, leukemia, lymphoma, and myeloma. In some embodiments, a sample of blood is collected to obtain the cell-free nucleic acid (e.g., cell-free DNA) in the blood.
A sample of blood may be a sample of whole blood or a sample of fractionated blood. In some embodiments, the sample of blood comprises whole blood. In some embodiments, the sample of blood comprises fractionated blood. In some embodiments, the sample of blood comprises buffy coat. In some embodiments, the sample of blood comprises serum. In some embodiments, the sample of blood comprises plasma. In some embodiments, the sample of blood comprises a blood clot.
A sample of tissue, in some embodiments, refers to a sample comprising cells from a tissue. In some embodiments, the sample of the tumor comprises non-cancerous cells from a tissue. In some embodiments, the sample of the tumor comprises precancerous cells from a tissue. In some embodiments, the sample of the tumor comprises cancerous tissue. In some embodiments, the sample can comprise cancerous, precancerous, or non-cancerous cells.
Methods of the present disclosure encompass a variety of tissue including organ tissue or non-organ tissue, including but not limited to, muscle tissue, brain tissue, lung tissue, liver tissue, epithelial tissue, connective tissue, and nervous tissue. In some embodiments, the tissue may be normal tissue, or it may be diseased tissue or it may be tissue suspected of being diseased. In some embodiments, the tissue may be sectioned tissue or whole intact tissue. In some embodiments, the tissue may be animal tissue or human tissue. Animal tissue includes, but is not limited to, tissues obtained from rodents (e.g., rats or mice), primates (e.g., monkeys), dogs, cats, and farm animals.
The biological sample may be from any source in the subject's body including, but not limited to, any fluid [such as blood (e.g., whole blood, blood serum, or blood plasma), saliva, tears, synovial fluid, cerebrospinal fluid, pleural fluid, pericardial fluid, ascitic fluid, and/or urine], hair, skin (including portions of the epidermis, dermis, and/or hypodermis), oropharynx, laryngopharynx, esophagus, stomach, bronchus, salivary gland, tongue, oral cavity, nasal cavity, vaginal cavity, anal cavity, bone, bone marrow, brain, thymus, spleen, small intestine, appendix, colon, rectum, anus, liver, biliary tract, pancreas, kidney, ureter, bladder, urethra, uterus, vagina, vulva, ovary, cervix, scrotum, penis, prostate, testicle, seminal vesicles, and/or any type of tissue (e.g., muscle tissue, epithelial tissue, connective tissue, or nervous tissue).
Any of the biological samples described herein may be obtained from the subject using any known technique. See, for example, the following publications on collecting, processing, and storing biological samples, each of which are incorporated herein in its entirety: Biospecimens and biorepositories: from afterthought to science by Vaught et al. (Cancer Epidemiol Biomarkers Prev. 2012 February; 21(2):253-5), and Biological sample collection, processing, storage and information management by Vaught and Henderson (IARC Sci Publ. 2011; (163):23-42).
In some embodiments, the biological sample may be obtained from a surgical procedure (e.g., laparoscopic surgery, microscopically controlled surgery, or endoscopy), bone marrow biopsy, punch biopsy, endoscopic biopsy, or needle biopsy (e.g., a fine-needle aspiration, core needle biopsy, vacuum-assisted biopsy, or image-guided biopsy).
In some embodiments, one or more than one cell (a cell biological sample) may be obtained from a subject using a scrape or brush method. The cell biological sample may be obtained from any area in or from the body of a subject including, for example, from one or more of the following areas: the cervix, esophagus, stomach, bronchus, or oral cavity. In some embodiments, one or more than one piece of tissue (e.g., a tissue biopsy) from a subject may be used. In certain embodiments, the tissue biopsy may comprise one or more than one (e.g., 2, 3, 4, 5, 6, 7, 8, 9, 10, or more than 10) biological samples from one or more tumors or tissues known or suspected of having cancerous cells.
Any of the biological samples from a subject described herein may be stored using any method that preserves stability of the biological sample. In some embodiments, preserving the stability of the biological sample means inhibiting components (e.g., DNA, RNA, protein, or tissue structure or morphology) of the biological sample from degrading until they are measured so that when measured, the measurements represent the state of the sample at the time of obtaining it from the subject. In some embodiments, a biological sample is stored in a composition that is able to penetrate the same and protect components (e.g., DNA, RNA, protein, or tissue structure or morphology) of the biological sample from degrading. As used herein, degradation is the transformation of a component from one from to another such that the first form is no longer detected at the same level as before degradation.
In some embodiments, a biological sample (e.g., tissue sample) is fixed. As used herein, a “fixed” sample relates to a sample that has been treated with one or more agents or processes in order to prevent or reduce decay or degradation, such as autolysis or putrefaction, of the sample. Examples of fixative processes include but are not limited to heat fixation, immersion fixation, and perfusion. In some embodiments a fixed sample is treated with one or more fixative agents. Examples of fixative agents include but are not limited to cross-linking agents (e.g., aldehydes, such as formaldehyde, formalin, glutaraldehyde, etc.), precipitating agents (e.g., alcohols, such as ethanol, methanol, acetone, xylene, etc.), mercurials (e.g., B-5, Zenker's fixative, etc.), picrates, and Hepes-glutamic acid buffer-mediated organic solvent protection effect (HOPE) fixatuve. In some embodiments, a biological sample (e.g., tissue sample) is treated with a cross-linking agent. In some embodiments, the cross-linking agent comprises formalin. In some embodiments, a formalin-fixed biological sample is embedded in a solid substrate, for example paraffin wax. In some embodiments, the biological sample is a formalin-fixed paraffin-embedded (FFPE) sample. Methods of preparing FFPE samples are known, for example as described by Li et al. JCO Precis Oncol. 2018; 2: PO.17.00091.
In some embodiments, the biological sample is stored using cryopreservation. Non-limiting examples of cryopreservation include, but are not limited to, step-down freezing, blast freezing, direct plunge freezing, snap freezing, slow freezing using a programmable freezer, and vitrification. In some embodiments, the biological sample is stored using lyophilization. In some embodiments, a biological sample is placed into a container that already contains a preservant (e.g., RNALater to preserve RNA) and then frozen (e.g., by snap-freezing), after the collection of the biological sample from the subject. In some embodiments, such storage in frozen state is done immediately after collection of the biological sample. In some embodiments, a biological sample may be kept at either room temperature or 4° C. for some time (e.g., up to an hour, up to 8 h, or up to 1 day, or a few days) in a preservant or in a buffer without a preservant, before being frozen.
Non-limiting examples of preservants include formalin solutions, formaldehyde solutions, RNALater or other equivalent solutions, TriZol or other equivalent solutions, DNA/RNA Shield or equivalent solutions, EDTA (e.g., Buffer AE (10 mM Tris-Cl; 0.5 mM EDTA, pH 9.0)) and other coagulants, and Acids Citrate Dextronse (e.g., for blood specimens).
In some embodiments, special containers may be used for collecting and/or storing a biological sample. For example, a vacutainer may be used to store blood. In some embodiments, a vacutainer may comprise a preservant (e.g., a coagulant, or an anticoagulant). In some embodiments, a container in which a biological sample is preserved may be contained in a secondary container, for the purpose of better preservation, or for the purpose of avoid contamination.
Any of the biological samples from a subject described herein may be stored under any condition that preserves stability of the biological sample. In some embodiments, the biological sample is stored at a temperature that preserves stability of the biological sample. In some embodiments, the sample is stored at room temperature (e.g., 25° C.). In some embodiments, the sample is stored under refrigeration (e.g., 4° C.). In some embodiments, the sample is stored under freezing conditions (e.g., −20° C.). In some embodiments, the sample is stored under ultralow temperature conditions (e.g., −50° C. to −800° C.). In some embodiments, the sample is stored under liquid nitrogen (e.g., −1700° C.). In some embodiments, a biological sample is stored at −60° C. to −80° C. (e.g., −70° C.) for up to 5 years (e.g., up to 1 month, up to 2 months, up to 3 months, up to 4 months, up to 5 months, up to 6 months, up to 7 months, up to 8 months, up to 9 months, up to 10 months, up to 11 months, up to 1 year, up to 2 years, up to 3 years, up to 4 years, or up to 5 years). In some embodiments, a biological sample is stored as described by any of the methods described herein for up to 20 years (e.g., up to 5 years, up to 10 years, up to 15 years, or up to 20 years).
Methods of the present disclosure encompass obtaining one or more biological samples from a subject for analysis. In some embodiments, one biological sample is collected from a subject for analysis. In some embodiments, more than one (e.g., 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, or more) biological samples are collected from a subject for analysis. In some embodiments, one biological sample from a subject will be analyzed. In some embodiments, more than one (e.g., 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, or more) biological samples may be analyzed. If more than one biological sample from a subject is analyzed, the biological samples may be procured at the same time (e.g., more than one biological sample may be taken in the same procedure), or the biological samples may be taken at different times (e.g., during a different procedure including a procedure 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 days; 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 weeks; 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 months, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 years, or 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 decades after a first procedure).
A second or subsequent biological sample may be taken or obtained from the same region (e.g., from the same tumor or area of tissue) or a different region (including, e.g., a different tumor). A second or subsequent biological sample may be taken or obtained from the subject after one or more treatments and may be taken from the same region or a different region. As a non-limiting example, the second or subsequent biological sample may be useful in determining whether the cancer in each biological sample has different characteristics (e.g., in the case of biological samples taken from two physically separate tumors in a patient) or whether the cancer has responded to one or more treatments (e.g., in the case of two or more biological samples from the same tumor or different tumors prior to and subsequent to a treatment). In some embodiments, each of the at least one biological sample is a bodily fluid sample, a cell sample, or a tissue biopsy sample.
In some embodiments, one or more biological specimens are combined (e.g., placed in the same container for preservation) before further processing. For example, a first sample of a first tumor obtained from a subject may be combined with a second sample of a second tumor from the subject, wherein the first and second tumors may or may not be the same tumor. In some embodiments, a first tumor and a second tumor are similar but not the same (e.g., two tumors in the brain of a subject). In some embodiments, a first biological sample and a second biological sample from a subject are sample of different types of tumors (e.g., a tumor in muscle tissue and brain tissue).
In some embodiments, a sample from which RNA and/or DNA is extracted (e.g., a sample of tumor, or a blood sample) is sufficiently large such that at least 2 μg (e.g., at least 2 μg, at least 2.5 μg, at least 3 μg, at least 3.5 μg or more) of RNA can be extracted from it. In some embodiments, the sample from which RNA and/or DNA is extracted can be peripheral blood mononuclear cells (PBMCs). In some embodiments, the sample from which RNA and/or DNA is extracted can be any type of cell suspension. In some embodiments, a sample from which RNA and/or DNA is extracted (e.g., a sample of tumor, or a blood sample) is sufficiently large such that at least 1.8 μg RNA can be extracted from it. In some embodiments, at least 50 mg (e.g., at least 1 mg, at least 2 mg, at least 3 mg, at least 4 mg, at least 5 mg, at least 10 mg, at least 12 mg, at least 15 mg, at least 18 mg, at least 20 mg, at least 22 mg, at least 25 mg, at least 30 mg, at least 35 mg, at least 40 mg, at least 45 mg, or at least 50 mg) of tissue sample is collected from which RNA and/or DNA is extracted. In some embodiments, at least 20 mg of tissue sample is collected from which RNA and/or DNA is extracted. In some embodiments, at least 30 mg of tissue sample is collected. In some embodiments, at least 10-50 mg (e.g., 10-50 mg, 10-15 mg, 10-30 mg, 10-40 mg, 20-30 mg, 20-40 mg, 20-50 mg, or 30-50 mg) of tissue sample is collected from which RNA and/or DNA is extracted. In some embodiments, at least 30 mg of tissue sample is collected. In some embodiments, at least 20-30 mg of tissue sample is collected from which RNA and/or DNA is extracted. In some embodiments, a sample from which RNA and/or DNA is extracted (e.g., a sample of tumor, or a blood sample) is sufficiently large such that at least 0.2 μg (e.g., at least 200 ng, at least 300 ng, at least 400 ng, at least 500 ng, at least 600 ng, at least 700 ng, at least 800 ng, at least 900 ng, at least 1 μg, at least 1.1 μg, at least 1.2 μg, at least 1.3 μg, at least 1.4 μg, at least 1.5 μg, at least 1.6 μg, at least 1.7 μg, at least 1.8 μg, at least 1.9 μg, or at least 2 μg) of RNA can be extracted from it. In some embodiments, a sample from which RNA and/or DNA is extracted (e.g., a sample of tumor, or a blood sample) is sufficiently large such that at least 0.1 μg (e.g., at least 100 ng, at least 200 ng, at least 300 ng, at least 400 ng, at least 500 ng, at least 600 ng, at least 700 ng, at least 800 ng, at least 900 ng, at least 1 μg, at least 1.1 μg, at least 1.2 μg, at least 1.3 μg, at least 1.4 μg, at least 1.5 μg, at least 1.6 μg, at least 1.7 μg, at least 1.8 μg, at least 1.9 μg, or at least 2 μg) of RNA can be extracted from it.
Subjects
Aspects of this disclosure relate to a biological sample that has been obtained from a subject. In some embodiments, a subject is a mammal (e.g., a human, a mouse, a cat, a dog, a horse, a hamster, a cow, a pig, or other domesticated animal). In some embodiments, a subject is a human. In some embodiments, a subject is an adult human (e.g., of 18 years of age or older). In some embodiments, a subject is a child (e.g., less than 18 years of age). In some embodiments, a human subject is one who has or has been diagnosed with at least one form of cancer.
In some embodiments, a cancer from which a subject suffers is a carcinoma, a sarcoma, a myeloma, a leukemia, a lymphoma, a melanoma, a mesothelioma, a glioma, or a mixed type of cancer that comprises more than one of a carcinoma, a sarcoma, a myeloma, a leukemia, and a lymphoma. Carcinoma refers to a malignant neoplasm of epithelial origin or cancer of the internal or external lining of the body. Sarcoma refers to cancer that originates in supportive and connective tissues such as bones, tendons, cartilage, muscle, and fat. Myeloma is cancer that originates in the plasma cells of bone marrow. Leukemias (“liquid cancers” or “blood cancers”) are cancers of the bone marrow (the site of blood cell production). Lymphomas develop in the glands or nodes of the lymphatic system, a network of vessels, nodes, and organs (specifically the spleen, tonsils, and thymus) that purify bodily fluids and produce infection-fighting white blood cells, or lymphocytes. Melanoma is a type of skin cancer that originates in the melanocytes of the skin. Mesothelioma's cancers arise from the mesothelium, which forms the lining of organs and cavities, such as, for example, the lungs and the abdomen. Glioma develops in the brain, and specifically in the glial cells, which provide physical and metabolic support to neurons. Non-limiting examples of a mixed type of cancer include adenosquamous carcinoma, mixed mesodermal tumor, carcinosarcoma, and teratocarcinoma. In some embodiments, a subject has a tumor. A tumor may be benign or malignant.
In some embodiments, a cancer is any one of the following: skin cancer, lung cancer, breast cancer, prostate cancer, colon cancer, pancreatic cancer, rectal cancer, cervical cancer, and cancer of the uterus. In some embodiments, a subject is at risk for developing cancer, e.g., because the subject has one or more genetic risk factors, or has been exposed to or is being exposed to one or more carcinogens (e.g., cigarette smoke, or chewing tobacco).
Expression Data
Expression data (e.g., indicating expression levels) for a plurality of genes may be used for any of the methods or compositions described herein. The number of genes which may be examined may be up to and inclusive of all the genes of the subject. In some embodiments, expression levels may be examined for all of the genes of a subject. As a non-limiting example, four or more, five or more, six or more, seven or more, eight or more, nine or more, ten or more, eleven or more, twelve or more, 13 or more, 14 or more, 15 or more, 16 or more, 17 or more, 18 or more, 19 or more, 20 or more, 21 or more, 22 or more, 23 or more, 24 or more, 25 or more, 26 or more, 27 or more, 28 or more, 29 or more, 30 or more, 35 or more, 40 or more, 50 or more, 60 or more, 70 or more, 80 or more, 90 or more, 100 or more, 125 or more, 150 or more, 175 or more, 200 or more, 225 or more, 250 or more, 275 or more, or 300 or more genes may be used for any evaluation described herein. As another set of non-limiting examples, the expression data may include expression data for at least 5, at least 10, at least 20, at least 25, at least 35, at least 50, at least 75, at least 100, at least 125, at least 150 or more genes selected from the genes listed in Table 1. Additionally or alternatively, the expression data my include expression data for at least 5, at least 10, at least 20, at least 25, at least 35, at least 50, at least 75, at least 100, at least 125, at least 150, at least 200, at least 250, at least 300, at least 350, at least 400 or more genes selected from the genes listed in Table 2.
Any method may be used on a sample from a subject in order to acquire expression data (e.g., indicating expression levels) for the plurality of genes. As a set of non-limiting examples, the expression data may be RNA expression data, DNA expression data, or protein expression data.
DNA expression data, in some embodiments, refers to a level of DNA (e.g., copy number of a chromosome, gene, or other genomic region) in a sample from a subject. The level of DNA in a sample from a subject having cancer may be elevated compared to the level of DNA in a sample from a subject not having cancer, e.g., a gene duplication in a cancer patient's sample. The level of DNA in a sample from a subject having cancer may be reduced and compared to the level of DNA in a sample from a subject not having cancer, e.g., a gene deletion in a cancer patient's sample.
DNA expression data, in some embodiments, refers to data (e.g., sequencing data) for DNA (e.g., coding or non-coding genomic DNA) present in a sample, for example, sequencing data for a gene that is present in a patient's sample. DNA that is present in a sample may or may not be transcribed, but it may be sequenced using DNA sequencing platforms. Such data may be useful, in some embodiments, to determine whether the patient has one or more mutations associated with a particular cancer.
RNA expression data may be acquired using any method known in the art including, but not limited to: whole transcriptome sequencing, total RNA sequencing, mRNA sequencing, targeted RNA sequencing, small RNA sequencing, ribosome profiling, RNA exome capture sequencing, and/or deep RNA sequencing. DNA expression data may be acquired using any method known in the art including any known method of DNA sequencing. For example, DNA sequencing may be used to identify one or more mutations in the DNA of a subject. Any technique used in the art to sequence DNA may be used with the methods and compositions described herein. As a set of non-limiting examples, the DNA may be sequenced through single-molecule real-time sequencing, ion torrent sequencing, pyrosequencing, sequencing by synthesis, sequencing by ligation (SOLiD sequencing), nanopore sequencing, or Sanger sequencing (chain termination sequencing). Protein expression data may be acquired using any method known in the art including, but not limited to: N-terminal amino acid analysis, C-terminal amino acid analysis, Edman degradation (including though use of a machine such as a protein sequenator), or mass spectrometry.
In some embodiments, the expression data is acquired through bulk RNA sequencing. Bulk RNA sequencing may include obtaining expression levels for each gene across RNA extracted from a large population of input cells (e.g., a mixture of different cell types.) In some embodiments, the expression data is acquired through single cell sequencing (e.g., scRNA-seq). Single cell sequencing may include sequencing individual cells.
In some embodiments, the expression data comprises whole exome sequencing (WES) data. In some embodiments, the expression data comprises whole genome sequencing (WGS) data. In some embodiments, the expression data comprises next-generation sequencing (NGS) data. In some embodiments, the expression data comprises microarray data.
Obtaining Expression Data
In some embodiments, a method to process expression data (e.g., data obtained from sequencing comprises obtaining expression data for a subject (e.g., a subject who has or has been diagnosed with a cancer). In some embodiments, obtaining expression data comprises obtaining a biological sample and processing it to perform sequencing using any one of the sequencing methods described herein. In some embodiments, expression data is obtained from a lab or center that has performed experiments to obtain expression data (e.g., a lab or center that has performed sequencing). In some embodiments, a lab or center is a medical lab or center.
In some embodiments, expression data is obtained by obtaining a computer storage medium (e.g., a data storage drive) on which the data exists. In some embodiments, expression data is obtained via a secured server (e.g., a SFTP server, or Illumina BaseSpace). In some embodiments, data is obtained in the form of a text-based filed (e.g., a FASTQ file). In some embodiments, a file in which sequencing data is stored also contains quality scores of the sequencing data). In some embodiments, a file in which sequencing data is stored also contains sequence identifier information.
Expression Levels
Expression data, in some embodiments, includes gene expression levels. Gene expression levels may be detected by detecting a product of gene expression such as mRNA and/or protein. In some embodiments, gene expression levels are determined by detecting a level of a mRNA in a sample. As used herein, the terms “determining” or “detecting” may include assessing the presence, absence, quantity and/or amount (which can be an effective amount) of a substance within a sample, including the derivation of qualitative or quantitative concentration levels of such substances, or otherwise evaluating the values and/or categorization of such substances in a sample from a subject.
Process 2300 begins at act 2302, where bulk sequencing data is obtained from a biological sample obtained from a subject. The bulk sequencing data is obtained by any suitable method, for example, using any of the methods described herein including at least with respect to
In some embodiments, the bulk sequencing data obtained at act 2302 comprises RNA-seq data. In some embodiments, the biological sample comprises blood or tissue. In some embodiments, the biological sample comprises one or more tumor cells and one or more TME cells.
Next, process 2300 proceeds to act 2304 where the sequencing data obtained at act 2302 is normalized to transcripts per kilobase million (TPM) units. The normalization may be performed using any suitable software and in any suitable way. For example, in some embodiments, TPM normalization may be performed according to the techniques described in Wagner et al. (Theory Biosci. (2012) 131:281-285), which is incorporated by reference herein in its entirety. In some embodiments, the TPM normalization may be performed using a software package, such as, for example, the gcrma package. Aspects of the gcrma package are described in Wu J, Gentry RIwcfJMJ (2021). “gcrma: Background Adjustment Using Sequence Information. R package version 2.66.0.”, which is incorporated by reference in its entirety herein. In some embodiments, RNA expression level in TPM units for a particular gene may be calculated according to the following formula:
Next, process 2300 proceeds to act 2306, where the expression levels in TPM units (as determined at act 2304) may be log transformed. Although, in some embodiments, the log transformation is optional and may be omitted.
Process 2300 is illustrative and there are variations. For example, in some embodiments, one or both of acts 2304 and 2306 may be omitted. Thus, in some embodiments, the expression levels may not be normalized to transcripts per million units and may, instead, be converted to another type of unit (e.g., reads per kilobase million (RPKM) or fragments per kilobase million (FPKM) or any other suitable unit). Additionally or alternatively, in some embodiments, the log transformation may be omitted. Instead, no transformation may be applied in some embodiments, or one or more other transformations may be applied in lieu of the log transformation.
Expression data obtained by process 2300 can include the sequence data generated by a sequencing protocol (e.g., the series of nucleotides in a nucleic acid molecule identified by next-generation sequencing, sanger sequencing, etc.) as well as information contained therein (e.g., information indicative of source, tissue type, etc.) which may also be considered information that can be inferred or determined from the sequence data. In some embodiments, expression data obtained by process 2300 can include information included in a FASTA file, a description and/or quality scores included in a FASTQ file, an aligned position included in a BAM file, and/or any other suitable information obtained from any suitable file.
Methods of Treatment
In certain methods described herein, an effective amount of anti-cancer therapy described herein may be administered or recommended for administration to a subject (e.g., a human) in need of the treatment via a suitable route (e.g., intravenous administration).
The subject to be treated by the methods described herein may be a human patient having, suspected of having, or at risk for a cancer. Examples of a cancer include, but are not limited to, melanoma, lung cancer, brain cancer, breast cancer, colorectal cancer, pancreatic cancer, liver cancer, prostate cancer, skin cancer, kidney cancer, bladder cancer, or prostate cancer. At the time of diagnosis, the cancer may be cancer of unknown primary. The subject to be treated by the methods described herein may be a mammal (e.g., may be a human). Mammals include but are not limited to: farm animals (e.g., livestock), sport animals, laboratory animals, pets, primates, horses, dogs, cats, mice, and rats.
A subject having a cancer may be identified by routine medical examination, e.g., laboratory tests, biopsy, PET scans, CT scans, or ultrasounds. A subject suspected of having a cancer might show one or more symptoms of the disorder, e.g., unexplained weight loss, fever, fatigue, cough, pain, skin changes, unusual bleeding or discharge, and/or thickening or lumps in parts of the body. A subject at risk for a cancer may be a subject having one or more of the risk factors for that disorder. For example, risk factors associated with cancer include, but are not limited to, (a) viral infection (e.g., herpes virus infection), (b) age, (c) family history, (d) heavy alcohol consumption, (e) obesity, and (f) tobacco use.
An “effective amount” as used herein refers to the amount of each active agent required to confer therapeutic effect on the subject, either alone or in combination with one or more other active agents. Effective amounts vary, as recognized by those skilled in the art, depending on the particular condition being treated, the severity of the condition, the individual patient parameters including age, physical condition, size, gender and weight, the duration of the treatment, the nature of concurrent therapy (if any), the specific route of administration and like factors within the knowledge and expertise of the health practitioner. These factors are well known to those of ordinary skill in the art and can be addressed with no more than routine experimentation. It is generally preferred that a maximum dose of the individual components or combinations thereof be used, that is, the highest safe dose according to sound medical judgment. It will be understood by those of ordinary skill in the art, however, that a patient may insist upon a lower dose or tolerable dose for medical reasons, psychological reasons, or for virtually any other reasons.
Empirical considerations, such as the half-life of a therapeutic compound, generally contribute to the determination of the dosage. For example, antibodies that are compatible with the human immune system, such as humanized antibodies or fully human antibodies, may be used to prolong half-life of the antibody and to prevent the antibody being attacked by the host's immune system. Frequency of administration may be determined and adjusted over the course of therapy and is generally (but not necessarily) based on treatment, and/or suppression, and/or amelioration, and/or delay of a cancer. Alternatively, sustained continuous release formulations of an anti-cancer therapeutic agent may be appropriate. Various formulations and devices for achieving sustained release are known in the art.
In some embodiments, dosages for an anti-cancer therapeutic agent as described herein may be determined empirically in individuals who have been administered one or more doses of the anti-cancer therapeutic agent. Individuals may be administered incremental dosages of the anti-cancer therapeutic agent. To assess efficacy of an administered anti-cancer therapeutic agent, one or more aspects of a cancer (e.g., tumor formation, tumor growth, molecular category identified for the cancer using the techniques described herein) may be analyzed.
Generally, for administration of any of the anti-cancer antibodies described herein, an initial candidate dosage may be about 2 mg/kg. For the purpose of the present disclosure, a typical daily dosage might range from about any of 0.1 μg/kg to 3 μg/kg to 30 μg/kg to 300 μg/kg to 3 mg/kg, to 30 mg/kg to 100 mg/kg or more, depending on the factors mentioned above. For repeated administrations over several days or longer, depending on the condition, the treatment is sustained until a desired suppression or amelioration of symptoms occurs or until sufficient therapeutic levels are achieved to alleviate a cancer, or one or more symptoms thereof. An exemplary dosing regimen comprises administering an initial dose of about 2 mg/kg, followed by a weekly maintenance dose of about 1 mg/kg of the antibody, or followed by a maintenance dose of about 1 mg/kg every other week. However, other dosage regimens may be useful, depending on the pattern of pharmacokinetic decay that the practitioner (e.g., a medical doctor) wishes to achieve. For example, dosing from one-four times a week is contemplated. In some embodiments, dosing ranging from about 3 μg/mg to about 2 mg/kg (such as about 3 μg/mg, about 10 μg/mg, about 30 μg/mg, about 100 μg/mg, about 300 μg/mg, about 1 mg/kg, and about 2 mg/kg) may be used. In some embodiments, dosing frequency is once every week, every 2 weeks, every 4 weeks, every 5 weeks, every 6 weeks, every 7 weeks, every 8 weeks, every 9 weeks, or every 10 weeks; or once every month, every 2 months, or every 3 months, or longer. The progress of this therapy may be monitored by conventional techniques and assays. The dosing regimen (including the therapeutic used) may vary over time.
When the anti-cancer therapeutic agent is not an antibody, it may be administered at the rate of about 0.1 to 300 mg/kg of the weight of the patient divided into one to three doses, or as disclosed herein. In some embodiments, for an adult patient of normal weight, doses ranging from about 0.3 to 5.00 mg/kg may be administered. The particular dosage regimen, e.g., dose, timing, and/or repetition, will depend on the particular subject and that individual's medical history, as well as the properties of the individual agents (such as the half-life of the agent, and other considerations well known in the art).
For the purpose of the present disclosure, the appropriate dosage of an anti-cancer therapeutic agent will depend on the specific anti-cancer therapeutic agent(s) (or compositions thereof) employed, the type and severity of cancer, whether the anti-cancer therapeutic agent is administered for preventive or therapeutic purposes, previous therapy, the patient's clinical history and response to the anti-cancer therapeutic agent, and the discretion of the attending physician. Typically, the clinician will administer an anti-cancer therapeutic agent, such as an antibody, until a dosage is reached that achieves the desired result.
Administration of an anti-cancer therapeutic agent can be continuous or intermittent, depending, for example, upon the recipient's physiological condition, whether the purpose of the administration is therapeutic or prophylactic, and other factors known to skilled practitioners. The administration of an anti-cancer therapeutic agent (e.g., an anti-cancer antibody) may be essentially continuous over a preselected period of time or may be in a series of spaced dose, e.g., either before, during, or after developing cancer.
As used herein, the term “treating” refers to the application or administration of a composition including one or more active agents to a subject, who has a cancer, a symptom of a cancer, or a predisposition toward a cancer, with the purpose to cure, heal, alleviate, relieve, alter, remedy, ameliorate, improve, or affect the cancer or one or more symptoms of the cancer, or the predisposition toward a cancer.
Alleviating a cancer includes delaying the development or progression of the disease or reducing disease severity. Alleviating the disease does not necessarily require curative results. As used therein, “delaying” the development of a disease (e.g., a cancer) means to defer, hinder, slow, retard, stabilize, and/or postpone progression of the disease. This delay can be of varying lengths of time, depending on the history of the disease and/or individuals being treated. A method that “delays” or alleviates the development of a disease, or delays the onset of the disease, is a method that reduces probability of developing one or more symptoms of the disease in a given period and/or reduces extent of the symptoms in a given time frame, when compared to not using the method. Such comparisons are typically based on clinical studies, using a number of subjects sufficient to give a statistically significant result.
“Development” or “progression” of a disease means initial manifestations and/or ensuing progression of the disease. Development of the disease can be detected and assessed using clinical techniques known in the art. However, development also refers to progression that may be undetectable. For purpose of this disclosure, development or progression refers to the biological course of the symptoms. “Development” includes occurrence, recurrence, and onset. As used herein “onset” or “occurrence” of a cancer includes initial onset and/or recurrence.
In some embodiments, the anti-cancer therapeutic agent (e.g., an antibody) described herein is administered to a subject in need of the treatment at an amount sufficient to reduce cancer (e.g., tumor) growth by at least 10% (e.g., 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90% or greater). In some embodiments, the anti-cancer therapeutic agent (e.g., an antibody) described herein is administered to a subject in need of the treatment at an amount sufficient to reduce cancer cell number or tumor size by at least 10% (e.g., 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90% or more). In other embodiments, the anti-cancer therapeutic agent is administered in an amount effective in altering cancer type. Alternatively, the anti-cancer therapeutic agent is administered in an amount effective in reducing tumor formation or metastasis.
Conventional methods, known to those of ordinary skill in the art of medicine, may be used to administer the anti-cancer therapeutic agent to the subject, depending upon the type of disease to be treated or the site of the disease. The anti-cancer therapeutic agent can also be administered via other conventional routes, e.g., administered orally, parenterally, by inhalation spray, topically, rectally, nasally, buccally, vaginally or via an implanted reservoir. The term “parenteral” as used herein includes subcutaneous, intracutaneous, intravenous, intramuscular, intraarticular, intraarterial, intrasynovial, intrasternal, intrathecal, intralesional, and intracranial injection or infusion techniques. In addition, an anti-cancer therapeutic agent may be administered to the subject via injectable depot routes of administration such as using 1-, 3-, or 6-month depot injectable or biodegradable materials and methods.
Injectable compositions may contain various carriers such as vegetable oils, dimethylactamide, dimethyformamide, ethyl lactate, ethyl carbonate, isopropyl myristate, ethanol, and polyols (e.g., glycerol, propylene glycol, liquid polyethylene glycol, and the like). For intravenous injection, water soluble anti-cancer therapeutic agents can be administered by the drip method, whereby a pharmaceutical formulation containing the antibody and a physiologically acceptable excipients is infused. Physiologically acceptable excipients may include, for example, 5% dextrose, 0.9% saline, Ringer's solution, and/or other suitable excipients. Intramuscular preparations, e.g., a sterile formulation of a suitable soluble salt form of the anti-cancer therapeutic agent, can be dissolved and administered in a pharmaceutical excipient such as Water-for-Injection, 0.9% saline, and/or 5% glucose solution.
In one embodiment, an anti-cancer therapeutic agent is administered via site-specific or targeted local delivery techniques. Examples of site-specific or targeted local delivery techniques include various implantable depot sources of the agent or local delivery catheters, such as infusion catheters, an indwelling catheter, or a needle catheter, synthetic grafts, adventitial wraps, shunts and stents or other implantable devices, site specific carriers, direct injection, or direct application. See, e.g., PCT Publication No. WO 00/53211 and U.S. Pat. No. 5,981,568, the contents of each of which are incorporated by reference herein for this purpose.
Targeted delivery of therapeutic compositions containing an antisense polynucleotide, expression vector, or subgenomic polynucleotides can also be used. Receptor-mediated DNA delivery techniques are described in, for example, Findeis et al., Trends Biotechnol. (1993) 11:202; Chiou et al., Gene Therapeutics: Methods and Applications Of Direct Gene Transfer (J. A. Wolff, ed.) (1994); Wu et al., J. Biol. Chem. (1988) 263:621; Wu et al., J. Biol. Chem. (1994) 269:542; Zenke et al., Proc. Natl. Acad. Sci. USA (1990) 87:3655; Wu et al., J. Biol. Chem. (1991) 266:338. The contents of each of the foregoing are incorporated by reference herein for this purpose.
Therapeutic compositions containing a polynucleotide may be administered in a range of about 100 ng to about 200 mg of DNA for local administration in a gene therapy protocol. In some embodiments, concentration ranges of about 500 ng to about 50 mg, about 1 μg to about 2 mg, about 5 μg to about 500 μg, and about 20 μg to about 100 μg of DNA or more can also be used during a gene therapy protocol.
Therapeutic polynucleotides and polypeptides can be delivered using gene delivery vehicles. The gene delivery vehicle can be of viral or non-viral origin (e.g., Jolly, Cancer Gene Therapy (1994) 1:51; Kimura, Human Gene Therapy (1994) 5:845; Connelly, Human Gene Therapy (1995) 1:185; and Kaplitt, Nature Genetics (1994) 6:148). The contents of each of the foregoing are incorporated by reference herein for this purpose. Expression of such coding sequences can be induced using endogenous mammalian or heterologous promoters and/or enhancers. Expression of the coding sequence can be either constitutive or regulated.
Viral-based vectors for delivery of a desired polynucleotide and expression in a desired cell are well known in the art. Exemplary viral-based vehicles include, but are not limited to, recombinant retroviruses (see, e.g., PCT Publication Nos. WO 90/07936; WO 94/03622; WO 93/25698; WO 93/25234; WO 93/11230; WO 93/10218; WO 91/02805; U.S. Pat. Nos. 5,219,740 and 4,777,127; GB Patent No. 2,200,651; and EP Patent No. 0 345 242), alphavirus-based vectors (e.g., Sindbis virus vectors, Semliki forest virus (ATCC VR-67; ATCC VR-1247), Ross River virus (ATCC VR-373; ATCC VR-1246) and Venezuelan equine encephalitis virus (ATCC VR-923; ATCC VR-1250; ATCC VR 1249; ATCC VR-532)), and adeno-associated virus (AAV) vectors (see, e.g., PCT Publication Nos. WO 94/12649, WO 93/03769; WO 93/19191; WO 94/28938; WO 95/11984 and WO 95/00655). Administration of DNA linked to killed adenovirus as described in Curiel, Hum. Gene Ther. (1992) 3:147 can also be employed. The contents of each of the foregoing are incorporated by reference herein for this purpose.
Non-viral delivery vehicles and methods can also be employed, including, but not limited to, polycationic condensed DNA linked or unlinked to killed adenovirus alone (see, e.g., Curiel, Hum. Gene Ther. (1992) 3:147); ligand-linked DNA (see, e.g., Wu, J. Biol. Chem. (1989) 264:16985); eukaryotic cell delivery vehicles cells (see, e.g., U.S. Pat. No. 5,814,482; PCT Publication Nos. WO 95/07994; WO 96/17072; WO 95/30763; and WO 97/42338) and nucleic charge neutralization or fusion with cell membranes. Naked DNA can also be employed. Exemplary naked DNA introduction methods are described in PCT Publication No. WO 90/11092 and U.S. Pat. No. 5,580,859. Liposomes that can act as gene delivery vehicles are described in U.S. Pat. No. 5,422,120; PCT Publication Nos. WO 95/13796; WO 94/23697; WO 91/14445; and EP Patent No. 0524968. Additional approaches are described in Philip, Mol. Cell. Biol. (1994) 14:2411, and in Woffendin, Proc. Natl. Acad. Sci. (1994) 91:1581. The contents of each of the foregoing are incorporated by reference herein for this purpose.
It is also apparent that an expression vector can be used to direct expression of any of the protein-based anti-cancer therapeutic agents (e.g., anti-cancer antibody). For example, peptide inhibitors that are capable of blocking (from partial to complete blocking) a cancer-causing biological activity are known in the art.
In some embodiments, more than one anti-cancer therapeutic agent, such as an antibody and a small molecule inhibitory compound, may be administered to a subject in need of the treatment. The agents may be of the same type or different types from each other. At least one, at least two, at least three, at least four, or at least five different agents may be co-administered. Generally anti-cancer agents for administration have complementary activities that do not adversely affect each other. Anti-cancer therapeutic agents may also be used in conjunction with other agents that serve to enhance and/or complement the effectiveness of the agents.
Treatment efficacy can be assessed by methods well-known in the art, e.g., monitoring tumor growth or formation in a patient subjected to the treatment. Alternatively or in addition to, treatment efficacy can be assessed by monitoring tumor type over the course of treatment (e.g., before, during, and after treatment).
A subject having cancer may be treated using any combination of anti-cancer therapeutic agents or one or more anti-cancer therapeutic agents and one or more additional therapies (e.g., surgery and/or radiotherapy). The term combination therapy, as used herein, embraces administration of more than one treatment (e.g., an antibody and a small molecule or an antibody and radiotherapy) in a sequential manner, that is, wherein each therapeutic agent is administered at a different time, as well as administration of these therapeutic agents, or at least two of the agents or therapies, in a substantially simultaneous manner.
Sequential or substantially simultaneous administration of each agent or therapy can be affected by any appropriate route including, but not limited to, oral routes, intravenous routes, intramuscular, subcutaneous routes, and direct absorption through mucous membrane tissues. The agents or therapies can be administered by the same route or by different routes. For example, a first agent (e.g., a small molecule) can be administered orally, and a second agent (e.g., an antibody) can be administered intravenously.
As used herein, the term “sequential” means, unless otherwise specified, characterized by a regular sequence or order, e.g., if a dosage regimen includes the administration of an antibody and a small molecule, a sequential dosage regimen could include administration of the antibody before, simultaneously, substantially simultaneously, or after administration of the small molecule, but both agents will be administered in a regular sequence or order. The term “separate” means, unless otherwise specified, to keep apart one from the other. The term “simultaneously” means, unless otherwise specified, happening or done at the same time, i.e., the agents are administered at the same time. The term “substantially simultaneously” means that the agents are administered within minutes of each other (e.g., within 10 minutes of each other) and intends to embrace joint administration as well as consecutive administration, but if the administration is consecutive it is separated in time for only a short period (e.g., the time it would take a medical practitioner to administer two agents separately). As used herein, concurrent administration and substantially simultaneous administration are used interchangeably. Sequential administration refers to temporally separated administration of the agents or therapies described herein.
Combination therapy can also embrace the administration of the anti-cancer therapeutic agent (e.g., an antibody) in further combination with other biologically active ingredients (e.g., a vitamin) and non-drug therapies (e.g., surgery or radiotherapy).
It should be appreciated that any combination of anti-cancer therapeutic agents may be used in any sequence for treating a cancer. The combinations described herein may be selected on the basis of a number of factors, which include but are not limited to reducing tumor formation or tumor growth, and/or alleviating at least one symptom associated with the cancer, or the effectiveness for mitigating the side effects of another agent of the combination. For example, a combined therapy as provided herein may reduce any of the side effects associated with each individual members of the combination, for example, a side effect associated with an administered anti-cancer agent.
In some embodiments, an anti-cancer therapeutic agent is an antibody, an immunotherapy, a radiation therapy, a surgical therapy, and/or a chemotherapy.
Examples of the antibody anti-cancer agents include, but are not limited to, alemtuzumab (Campath), trastuzumab (Herceptin), Ibritumomab tiuxetan (Zevalin), Brentuximab vedotin (Adcetris), Ado-trastuzumab emtansine (Kadcyla), blinatumomab (Blincyto), Bevacizumab (Avastin), Cetuximab (Erbitux), ipilimumab (Yervoy), nivolumab (Opdivo), pembrolizumab (Keytruda), atezolizumab (Tecentriq), avelumab (Bavencio), durvalumab (Imfinzi), and panitumumab (Vectibix).
Examples of an immunotherapy include, but are not limited to, a PD-1 inhibitor or a PD-L1 inhibitor, a CTLA-4 inhibitor, adoptive cell transfer, therapeutic cancer vaccines, oncolytic virus therapy, T-cell therapy, and immune checkpoint inhibitors.
Examples of radiation therapy include, but are not limited to, ionizing radiation, gamma-radiation, neutron beam radiotherapy, electron beam radiotherapy, proton therapy, brachytherapy, systemic radioactive isotopes, and radiosensitizers.
Examples of a surgical therapy include, but are not limited to, a curative surgery (e.g., tumor removal surgery), a preventive surgery, a laparoscopic surgery, and a laser surgery.
Examples of the chemotherapeutic agents include, but are not limited to, Carboplatin or Cisplatin, Docetaxel, Gemcitabine, Nab-Paclitaxel, Paclitaxel, Pemetrexed, and Vinorelbine.
Additional examples of chemotherapy include, but are not limited to, Platinating agents, such as Carboplatin, Oxaliplatin, Cisplatin, Nedaplatin, Satraplatin, Lobaplatin, Triplatin, Tetranitrate, Picoplatin, Prolindac, Aroplatin and other derivatives; Topoisomerase I inhibitors, such as Camptothecin, Topotecan, irinotecan/SN38, rubitecan, Belotecan, and other derivatives; Topoisomerase II inhibitors, such as Etoposide (VP-16), Daunorubicin, a doxorubicin agent (e.g., doxorubicin, doxorubicin hydrochloride, doxorubicin analogs, or doxorubicin and salts or analogs thereof in liposomes), Mitoxantrone, Aclarubicin, Epirubicin, Idarubicin, Amrubicin, Amsacrine, Pirarubicin, Valrubicin, Zorubicin, Teniposide and other derivatives; Antimetabolites, such as Folic family (Methotrexate, Pemetrexed, Raltitrexed, Aminopterin, and relatives or derivatives thereof); Purine antagonists (Thioguanine, Fludarabine, Cladribine, 6-Mercaptopurine, Pentostatin, clofarabine, and relatives or derivatives thereof) and Pyrimidine antagonists (Cytarabine, Floxuridine, Azacitidine, Tegafur, Carmofur, Capacitabine, Gemcitabine, hydroxyurea, 5-Fluorouracil (5FU), and relatives or derivatives thereof); Alkylating agents, such as Nitrogen mustards (e.g., Cyclophosphamide, Melphalan, Chlorambucil, mechlorethamine, Ifosfamide, mechlorethamine, Trofosfamide, Prednimustine, Bendamustine, Uramustine, Estramustine, and relatives or derivatives thereof); nitrosoureas (e.g., Carmustine, Lomustine, Semustine, Fotemustine, Nimustine, Ranimustine, Streptozocin, and relatives or derivatives thereof); Triazenes (e.g., Dacarbazine, Altretamine, Temozolomide, and relatives or derivatives thereof); Alkyl sulphonates (e.g., Busulfan, Mannosulfan, Treosulfan, and relatives or derivatives thereof); Procarbazine; Mitobronitol, and Aziridines (e.g., Carboquone, Triaziquone, ThioTEPA, triethylenemalamine, and relatives or derivatives thereof); Antibiotics, such as Hydroxyurea, Anthracyclines (e.g., doxorubicin agent, daunorubicin, epirubicin and relatives or derivatives thereof); Anthracenediones (e.g., Mitoxantrone and relatives or derivatives thereof); Streptomyces family antibiotics (e.g., Bleomycin, Mitomycin C, Actinomycin, and Plicamycin); and ultraviolet light.
Computer Implementation
An illustrative implementation of a computer system 2400 that may be used in connection with any of the embodiments of the technology described herein (e.g., such as the methods of
Computing device 2400 may also include a network input/output (I/O) interface 2440 via which the computing device may communicate with other computing devices (e.g., over a network), and may also include one or more user I/O interfaces 2450, via which the computing device may provide output to and receive input from a user. The user I/O interfaces may include devices such as a keyboard, a mouse, a microphone, a display device (e.g., a monitor or touch screen), speakers, a camera, and/or various other types of I/O devices.
The above-described embodiments can be implemented in any of numerous ways. For example, the embodiments may be implemented using hardware, software, or a combination thereof. When implemented in software, the software code can be executed on any suitable processor (e.g., a microprocessor) or collection of processors, whether provided in a single computing device or distributed among multiple computing devices. It should be appreciated that any component or collection of components that perform the functions described above can be generically considered as one or more controllers that control the above-described functions. The one or more controllers can be implemented in numerous ways, such as with dedicated hardware, or with general purpose hardware (e.g., one or more processors) that is programmed using microcode or software to perform the functions recited above.
In this respect, it should be appreciated that one implementation of the embodiments described herein comprises at least one computer-readable storage medium (e.g., RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or other tangible, non-transitory computer-readable storage medium) encoded with a computer program (i.e., a plurality of executable instructions) that, when executed on one or more processors, performs the above-described functions of one or more embodiments. The computer-readable medium may be transportable such that the program stored thereon can be loaded onto any computing device to implement aspects of the techniques described herein. In addition, it should be appreciated that the reference to a computer program which, when executed, performs any of the above-described functions, is not limited to an application program running on a host computer. Rather, the terms computer program and software are used herein in a generic sense to reference any type of computer code (e.g., application software, firmware, microcode, or any other form of computer instruction) that can be employed to program one or more processors to implement aspects of the techniques described herein.
The foregoing description of implementations provides illustration and description but is not intended to be exhaustive or to limit the implementations to the precise form disclosed. Modifications and variations are possible in light of the above teachings or may be acquired from practice of the implementations. In other implementations the methods depicted in these figures may include fewer operations, different operations, differently ordered operations, and/or additional operations. Further, non-dependent blocks may be performed in parallel.
It will be apparent that example aspects, as described above, may be implemented in many different forms of software, firmware, and hardware in the implementations illustrated in the figures. Further, certain portions of the implementations may be implemented as a “module” that performs one or more functions. This module may include hardware, such as a processor, an application-specific integrated circuit (ASIC), or a field-programmable gate array (FPGA), or a combination of hardware and software.
Having thus described several aspects and embodiments of the technology set forth in the disclosure, it is to be appreciated that various alterations, modifications, and improvements will readily occur to those skilled in the art. Such alterations, modifications, and improvements are intended to be within the spirit and scope of the technology described herein. For example, those of ordinary skill in the art will readily envision a variety of other means and/or structures for performing the function and/or obtaining the results and/or one or more of the advantages described herein, and each of such variations and/or modifications is deemed to be within the scope of the embodiments described herein. Those skilled in the art will recognize or be able to ascertain using no more than routine experimentation many equivalents to the specific embodiments described herein. It is, therefore, to be understood that the foregoing embodiments are presented by way of example only and that, within the scope of the appended claims and equivalents thereto, inventive embodiments may be practiced otherwise than as specifically described. In addition, any combination of two or more features, systems, articles, materials, kits, and/or methods described herein, if such features, systems, articles, materials, kits, and/or methods are not mutually inconsistent, is included within the scope of the present disclosure.
The above-described embodiments can be implemented in any of numerous ways. One or more aspects and embodiments of the present disclosure involving the performance of processes or methods may utilize program instructions executable by a device (e.g., a computer, a processor, or other device) to perform, or control performance of, the processes or methods. In this respect, various inventive concepts may be embodied as a computer readable storage medium (or multiple computer readable storage media) (e.g., a computer memory, one or more floppy discs, compact discs, optical discs, magnetic tapes, flash memories, circuit configurations in Field Programmable Gate Arrays or other semiconductor devices, or other tangible computer storage medium) encoded with one or more programs that, when executed on one or more computers or other processors, perform methods that implement one or more of the various embodiments described above. The computer readable medium or media can be transportable, such that the program or programs stored thereon can be loaded onto one or more different computers or other processors to implement various ones of the aspects described above. In some embodiments, computer readable media may be non-transitory media.
The terms “program” or “software” are used herein in a generic sense to refer to any type of computer code or set of computer-executable instructions that can be employed to program a computer or other processor to implement various aspects as described above. Additionally, it should be appreciated that according to one aspect, one or more computer programs that when executed perform methods of the present disclosure need not reside on a single computer or processor, but may be distributed in a modular fashion among a number of different computers or processors to implement various aspects of the present disclosure.
Computer-executable instructions may be in many forms, such as program modules, executed by one or more computers or other devices. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. Typically the functionality of the program modules may be combined or distributed as desired in various embodiments.
Also, data structures may be stored in computer-readable media in any suitable form. For simplicity of illustration, data structures may be shown to have fields that are related through location in the data structure. Such relationships may likewise be achieved by assigning storage for the fields with locations in a computer-readable medium that convey relationship between the fields. However, any suitable mechanism may be used to establish a relationship between information in fields of a data structure, including through the use of pointers, tags or other mechanisms that establish relationship between data elements.
When implemented in software, the software code can be executed on any suitable processor or collection of processors, whether provided in a single computer or distributed among multiple computers.
Also, a computer may have one or more input and output devices. These devices can be used, among other things, to present a user interface. Examples of output devices that can be used to provide a user interface include printers or display screens for visual presentation of output and speakers or other sound generating devices for audible presentation of output. Examples of input devices that can be used for a user interface include keyboards, and pointing devices, such as mice, touch pads, and digitizing tablets. As another example, a computer may receive input information through speech recognition or in other audible formats.
Such computers may be interconnected by one or more networks in any suitable form, including a local area network or a wide area network, such as an enterprise network, and intelligent network (IN) or the Internet. Such networks may be based on any suitable technology and may operate according to any suitable protocol and may include wireless networks, wired networks or fiber optic networks.
Also, as described, some aspects may be embodied as one or more methods. The acts performed as part of the method may be ordered in any suitable way. Accordingly, embodiments may be constructed in which acts are performed in an order different than illustrated, which may include performing some acts simultaneously, even though shown as sequential acts in illustrative embodiments.
All definitions, as defined and used herein, should be understood to control over dictionary definitions, definitions in documents incorporated by reference, and/or ordinary meanings of the defined terms.
The indefinite articles “a” and “an,” as used herein in the specification and in the claims, unless clearly indicated to the contrary, should be understood to mean “at least one.”
The phrase “and/or,” as used herein in the specification and in the claims, should be understood to mean “either or both” of the elements so conjoined, i.e., elements that are conjunctively present in some cases and disjunctively present in other cases. Multiple elements listed with “and/or” should be construed in the same fashion, i.e., “one or more” of the elements so conjoined. Other elements may optionally be present other than the elements specifically identified by the “and/or” clause, whether related or unrelated to those elements specifically identified. Thus, as a non-limiting example, a reference to “A and/or B”, when used in conjunction with open-ended language such as “comprising” can refer, in one embodiment, to A only (optionally including elements other than B); in another embodiment, to B only (optionally including elements other than A); in yet another embodiment, to both A and B (optionally including other elements); etc.
As used herein in the specification and in the claims, the phrase “at least one,” in reference to a list of one or more elements, should be understood to mean at least one element selected from any one or more of the elements in the list of elements, but not necessarily including at least one of each and every element specifically listed within the list of elements and not excluding any combinations of elements in the list of elements. This definition also allows that elements may optionally be present other than the elements specifically identified within the list of elements to which the phrase “at least one” refers, whether related or unrelated to those elements specifically identified. Thus, as a non-limiting example, “at least one of A and B” (or, equivalently, “at least one of A or B,” or, equivalently “at least one of A and/or B”) can refer, in one embodiment, to at least one, optionally including more than one, A, with no B present (and optionally including elements other than B); in another embodiment, to at least one, optionally including more than one, B, with no A present (and optionally including elements other than A); in yet another embodiment, to at least one, optionally including more than one, A, and at least one, optionally including more than one, B (and optionally including other elements); etc.
In the claims, as well as in the specification above, all transitional phrases such as “comprising,” “including,” “carrying,” “having,” “containing,” “involving,” “holding,” “composed of,” and the like are to be understood to be open-ended, i.e., to mean including but not limited to. Only the transitional phrases “consisting of” and “consisting essentially of” shall be closed or semi-closed transitional phrases, respectively.
The terms “approximately,” “substantially,” and “about” may be used to mean within ±20% of a target value in some embodiments, within ±10% of a target value in some embodiments, within ±5% of a target value in some embodiments, within ±2% of a target value in some embodiments. The terms “approximately,” “substantially,” and “about” may include the target value.
This application claims benefit under 35 U.S.C. § 119(e) of the filing date of U.S. provisional patent application Ser. No. 63/239,895, filed Sep. 1, 2021, entitled “MACHINE LEARNING TECHNIQUES FOR ESTIMATING MALIGNANT CELL GENE EXPRESSION IN COMPLEX TUMOR TISSUE,” Attorney Docket No. B1462.70026US01, and U.S. provisional patent application Ser. No. 63/181,365, filed Apr. 29, 2021, entitled “COMPUTATIONAL MACHINE LEARNING TOOL TO DECIPHER MALIGNANT CELL GENE EXPRESSION FROM COMPLEX TUMOR TISSUE”, Attorney Docket No. B1462.70026US00, the entire contents of each of which are incorporated by reference herein.
Number | Date | Country | |
---|---|---|---|
63239895 | Sep 2021 | US | |
63181365 | Apr 2021 | US |