The present invention relates to a computer-implemented method and a processing system for deconvolution of bulk RNA-sequencing data.
In the recent years, Next Generation Sequences (NGS) has allowed to obtain information on the genetic activity of the human cells, thus allowing characterizing the Tumor Micro Environment (TME) of patients. With NGS techniques, it is possible to obtain bulk sequencing, that is, a measurement expression level of selected genes in patients. However, bulk sequencing only measures averaged expression levels, due to the bulk mixture containing different cell types, thus creating confounding factors during the characterization of the TME. For this reason, single-cell sequencing (in particular singleRNA-cell sequencing), which analyzes the gene expression at the resolution of single cells of a patient, is becoming more and more widespread. However, single cell sequencing suffers from different problems, such as the high number of dropouts, and higher costs.
Bulk sequencing remains today a more viable solution for obtaining information on gene expressions, but the problem of deducing the original proportion of the different cells in the mixture persists. In order to overcome this limitation deconvolution algorithms have been developed: given a N×M matrix, in which N are the observations, i.e. patients, and M are the genes, deconvolution algorithms output a N×P matrix, in which P are the different cell types, and values of P are the percentage of that cell type in each of the N patients. Current methods rely on using single cell sequencing for different cell types to train predictive algorithms (both probabilistic and ML-based).
US 2021/0142867 A1 discloses a method for deconvolving bulk RNA sequencing data using single-cell RNA-seq of cell types that are relevant to the bulk tissues which are further used to stratify patients. The method comprises selecting a subset of the matrix of counts-based sequencing data from single RNA-seq data. Further, the method involves selecting informative genes, which include using a gradient function whose computation involves discrete approximation. The gene selection method involves excluding non-informative genes that introduce noise. A loss function is defined between the bulk distribution and the mixed single-cell distribution to estimate cell type proportions in the bulk RNA-sequencing data.
In an embodiment, the present disclosure provides a computer-implemented method for deconvolution of bulk RNA sequencing data. The method comprises: obtaining input from sources, wherein the input comprises single-cell RNA sequencing (RNA-seq) data; generating, from the single-cell RNA sequencing data, diverse datasets based on a principle of same generating mixture probability such that each of the diverse datasets has a same cell type mixture proportion; using the generated diverse datasets as input datasets for training a prediction model using machine learning, wherein the training comprises: creating a causal prediction model in which virtual samples are generated from the generated diverse datasets; and performing contrastive learning on the causal prediction model, wherein a contrastive loss is used for the learning of invariant features with respect to a measurement mechanism by which the single-cell RNA sequencing datasets have been generated; and using the trained prediction model to predict the mixture of cell type quantities in the bulk RNA sequencing data.
Subject matter of the present disclosure will be described in even greater detail below based on the exemplary figures. All features described and/or illustrated herein can be used alone or combined in different combinations. The features and advantages of various embodiments will become apparent by reading the following detailed description with reference to the attached drawings, which illustrate the following:
In accordance with an embodiment, the present invention improves and further develops a method and a system of the aforementioned type for deconvolution of bulk RNA-sequencing data in such a way that the predictions become more stable and reliable.
In accordance with another embodiment, the present invention provides a computer-implemented method for deconvolution of bulk RNA sequencing data, the method comprising: obtaining input from sources comprising single-cell RNA sequencing, RNA-seq, data; generating, from the single-cell RNA sequencing data, diverse datasets based on the principle of same generating mixture probability such that each of the datasets has the same cell type mixture proportion; and using the generated datasets as input datasets for training a model using machine learning. The training comprises creating a causal prediction model in which virtual samples are generated from the generated diverse datasets; and performing contrastive learning on the causal prediction model, wherein the contrastive loss is used for the learning of invariant features with respect to the measurement mechanism by which the single-cell RNA sequencing datasets have been generated. The method further comprises using the trained prediction model to predict the mixture of cell type quantities contained in the bulk RNA sequencing data.
Furthermore, in accordance with an embodiment, the present invention provides a system for deconvolution of bulk RNA-sequencing data and by a tangible, non-transitory computer-readable medium as specified in the independent claims.
Embodiments of the present invention address the problem of estimating the fraction of different cell types from bulk gene sequencing. This is a crucial issue since single-cell sequencing is a costly procedure, while bulk gene sequencing is more affordable. Embodiments of the invention propose a method that uses the diversity of training single-cell sequencing datasets and improves performances on existing methods. Embodiments of the proposed method derive the cell type proportions from the (patient) tissue's aggregated gene expression and can be used to evaluate the patient's response of potential treatment (patient stratification).
Embodiments of the present invention provide a method for bulk gene deconvolution, in particular with an application for patient stratification. The method may include creating a model for reconstructing proportional composition of cell types out of averaged expression quantities. An initial step may include receiving a single-cell dataset and generating diverse datasets each from each source based on the principle of the same probability or mixing single-cell datasets. A next step may include training a model using machine learning, wherein the model includes a discrete gate module and a contrastive loss module. The discrete module may be used for removing noisy features and may be based on the information bottleneck (IB) principle, wherein the contrastive loss may comprise a causal model in which virtual samples are generated from a plurality of single-cell sequence datasets. The contrastive loss promotes the learning of invariant features with respect to the measurement. During the last step, the trained model may be used to predict the mixture/portion/fraction of quantities from the bulk data.
According to an embodiment, the deconvolution system comprises a discrete gate module that is used to promote removal of noisy features from the input datasets. In this context, it may be provided that that the information bottleneck (IB) principle is used as loss function to train the discrete gate module.
According to an embodiment, the deconvolution system comprises an auto encoding mechanism that is configured to transform input gene expressions into latent features.
According to an embodiment, it may be provided that the prediction is extended to the gene expression per cell type. This may be accomplished by execution of the steps of (i) generating in the training data also the bulk gene expression per cell type, and (ii) using a loss function that measures the reconstruction error on the single cell type gene expression and on the reconstruction of the gene expression and the mixture of the gene expression per cell type weighted by the cell type proportion.
According to an embodiment, a simulated distribution of cell types may be used to generate a bulk gene expression from the single-cell RNA sequencing data. The gene expressions of the single cell RNA-sequencing data may be combined in proportion to the probability of a specific cell type to generate an aggregated bulk gene expression. Samples of the aggregated bulk gene expression may then be mapped using a gene graph to train a graph neural network, GNN, wherein the graph of the GNN is learned separately, based on the single cell gene expressions. The output of the GNN may be used to predict the mixture of cell type quantities contained in the bulk RNA-sequencing data.
According to an embodiment, it may be provided that connections among genes are determined by means of a transformer network. The transformer network may then be used to predict the cell types and the gene expressions per cell type. This prediction may be performed by (i) computing matrices K,Q,V (related to key, query, value vectors, respectively) for each gene at each layer of the transformer network, and (ii) computing the cell types and the gene expressions per cell type based on a softmax attention mechanism at each layer of the transformer network.
According to an embodiment, the predicted mixture of cell type quantities may be used for patient stratification. In this context, it may be provided that first a cell type x gene expression matrix is generated based on the predicted mixture of cell type quantities. This matrix may then be combined with domain knowledge information and/or with additional patient information and may be embedded by means of a multimodal embedding model. This embedding may be used for making a patient specific risk prediction with respect to diseases of interest.
According to an embodiment, the gene expression per cell type may be used to automatically calibrate the measurements by which the bulk RNA-sequencing data are obtained. This automatic calibration function may be realized by conducting a separate measurement with an external measuring device, preferably a microscopy. In this separate measurement, cell type counting may be performed and the obtained results may be compared with the cell type counts predicted by the trained prediction model. Based on the obtained cell type count differences, the measurements may be automatically calibrated.
According to an alternative embodiment, the automatic calibration function may be implemented by splitting the sample from which the bulk RNA-sequencing data are obtained in two samples and by conducting separate gene expression measurements on each of the two samples. Next, a first prediction model may be generated and trained for the measurement on a first one of the two samples and a second prediction model may be generated and trained for the measurement on the second one of the two samples. Finally, the predictions may be automatically corrected such that the two separate gene expression measurements yield the same results.
There are several ways how to design and further develop the teaching of the present invention in an advantageous way. To this end, it is to be referred to the dependent claims on the one hand and to the following explanation of preferred embodiments of the invention by way of example, illustrated by the figure on the other hand. In connection with the explanation of the preferred embodiments of the invention by the aid of the figure, generally preferred embodiments and further developments of the teaching will be explained. In the drawing
Embodiments of the present invention provide computer-implemented methods and processing systems for bulk gene deconvolution that implement Machine Learning (ML) techniques to predict cell type percentages (or proportions) when testing on a bulk gene RNA-sequencing sample. In this embodiment, the present invention estimates a model that is invariant to the underlying measurement.
According to an embodiment, the present invention provides a method for creating a model for reconstructing proportional composition of cell types out of averaged expression quantities. The method may comprise the following steps:
In a brief overview, according to embodiments of the invention, the systems and methods for bulk gene deconvolution disclosed herein may implement one or more of the following aspects, each of which will be described in more detail further below:
According to embodiments of the invention, the systems and methods for bulk gene deconvolution disclosed herein may receive one or more of the following data as input:
In an embodiment, the system may receive further information, in particular Electronic Health Record information related to a status of the respective patient, (e.g. vital parameters like heart rate, etc.). This additional information may be associated to the measured RNA-seq data and may be used in an embodiment that uses a particular disease as input and conditions the model on this information.
According to embodiments of the invention, the systems and methods for bulk gene deconvolution disclosed herein may be configured to output the proportions of the different cell types contained in a bulk sequencing probe. In addition, the output may include information on active diseases and on the status of cells.
According to an embodiment of the present invention, the bulk gene deconvolution system may be configured to use a discrete gating function 410 as well as an auto encoding mechanism 420, as exemplarily illustrated in
According to an embodiment of the present invention, the bulk gene deconvolution system also considers the case, where the features are discrete (i.e. the latent features t′ are discrete values). In this regard,
For learning an invariant model, i.e. a model that is invariant to the multiple single cell sequencing environments, embodiments of the invention provide a sample generation process that starts with sample mixing probabilities and, then, generates the bulk samples according to these probabilities. The invariant model may then be learned using the contrastive loss. The contrastive loss may require that the feature probabilities (pij) computed from the mapping of the features (ϕi, ϕ′i) using a softmax function (or normalized exponential function) pij∝exp ϕiTϕj/τ are similar on two set of samples taken from two different single cell datasets:
The contrastive learning is described in more detail in connection with
As regards the deployed loss functions, according to an embodiment, it may be provided that reconstruction loss (11, 12, root mean square) is used for the mixing probabilities and when predicting the gene expression, while the KL divergence may be considered as loss function in the contrastive learning described above. The information bottleneck loss may be used to reduce the information at the feature level.
Next, the case is considered, where the network encoding and predicting is a graph convolution network. In this case, the gene expression may be converted into a graph, where the connections are defined using the k-nearest neighbor based on a similarity between two gene activation patterns.
In this context,
Embodiments of the present invention address the problem of improving the RNA-sequencing measurements performed in laboratory based on the prediction of the cell type and gene expression per cell type. In this context, two different scenarios are considered where the output of the proposed bulk gene deconvolution system is used to calibrate a sequencing system. When there is another way to evaluate the cell type composition, for example using a microscopy, the measurement system can be modified so that the two quantities become close. This scenario including an automatic calibration using separated measurements is schematically illustrated in
Alternatively, as exemplary shown in
Embodiments of the present invention provide methods and systems that use bulk to mixture predictions as disclosed herein for the purpose of patient stratification. A respective patient stratification workflow is exemplarily illustrated in
When using a decoder network, the training may be performed in the modified way. In this context,
According to embodiments, the present invention provides methods and systems for conditional prediction of cell type on diseases. These embodiments consider the case where information on the disease is available when training the regression model. In this case, the virtual bulk can be used and the disease mixture can be added. When estimating, also the presence of diseases can be predicted.
Alternatively, when training for single diseases separately, in this case, the disease may be used as input (e.g. using an indicator vector or hot encoded) and the model is conditioned on this information. In this way, one can have different predictions depending on the hypothesis on the disease.
The information of the predicted disease can be used to reject the prediction or to select a dedicated model specific for the specific disease. In addition to the disease, other information (for example from the Electronic Health Record information) can be used similarly.
Embodiments of the present invention provide the following advantages: By modelling the bulk as different environments, the model can learn invariant and more stable predictions. The method thus provides more reliable information. An additional confidence information may be provided when training with variational methods. With the contrastive loss, the trained feature can be used for further processing, if the granularity is at the level of cell type. The prediction method according to embodiments of the present invention as disclosed herein outperforms in almost all scenarios the competing Scaden method (for reference, see Menden, K., Marouf, M., Oller, S., Dalmia, A., Magruder, D. S., Kloiber, K., Heutink, P. and Bonn, S., 2020. Deep learning-based cell composition analysis from tissue expression profiles. Science advances, 6(30), p.eaba2619).
Many modifications and other embodiments of the invention set forth herein will come to mind to the one skilled in the art to which the invention pertains having the benefit of the teachings presented in the foregoing description and the associated drawings. Therefore, it is to be understood that the invention is not to be limited to the specific embodiments disclosed and that modifications and other embodiments are intended to be included within the scope of the appended claims. Although specific terms are employed herein, they are used in a generic and descriptive sense only and not for purposes of limitation.
While subject matter of the present disclosure has been illustrated and described in detail in the drawings and foregoing description, such illustration and description are to be considered illustrative or exemplary and not restrictive. Any statement made herein characterizing the invention is also to be considered illustrative or exemplary and not restrictive as the invention is defined by the claims. It will be understood that changes and modifications may be made, by those of ordinary skill in the art, within the scope of the following claims, which may include any combination of features from different embodiments described above.
The terms used in the claims should be construed to have the broadest reasonable interpretation consistent with the foregoing description. For example, the use of the article “a” or “the” in introducing an element should not be interpreted as being exclusive of a plurality of elements. Likewise, the recitation of “or” should be interpreted as being inclusive, such that the recitation of “A or B” is not exclusive of “A and B,” unless it is clear from the context or the foregoing description that only one of A and B is intended. Further, the recitation of “at least one of A, B and C” should be interpreted as one or more of a group of elements consisting of A, B and C, and should not be interpreted as requiring at least one of each of the listed elements A, B and C, regardless of whether A, B and C are related as categories or otherwise. Moreover, the recitation of “A, B and/or C” or “at least one of A, B or C” should be interpreted as including any singular entity from the listed elements, e.g., A, any subset from the listed elements, e.g., A and B, or the entire list of elements A, B and C.
Number | Date | Country | Kind |
---|---|---|---|
21193630.7 | Aug 2021 | EP | regional |
This application is a U.S. National Phase application under 35 U.S.C. § 371 of International Application No. PCT/EP2022/056221, filed on Mar. 10, 2022, and claims benefit to European Patent Application No. EP 21193630.7, filed on Aug. 27, 2021. The International Application was published in English on Mar. 2, 2023 as WO 2023/025419 A1 under PCT Article 21(2).
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/EP2022/056221 | 3/10/2022 | WO |