The present invention generally relates to identifying cell types or cell states based on gene-gene interactions.
RNA sequencing (RNA-seq) is a genomic technique for the detection and analysis of messenger RNA in a sample. In the past, RNA-seq was necessarily performed on samples containing many cells. Single-cell RNA-seq (scRNA-seq) has been enabled via advancements in laboratory technology and technique, and refers to the detection and analysis of messenger RNA from a single cell. Despite data derived from only a single cell, because of the large number of cells (thousands to millions) in a biological study, scRNA-seq data comes as a very large table (rows denoting cells and columns denoting genes).
Convolutional neural networks (CNNs) are a type of machine learning model characterized by at least one convolutional layer. CNNs have wide applications including image processing.
Systems and methods for cell typing in accordance with embodiments of the invention are illustrated. One embodiment includes a cell typing system, including a processor, and a memory, the memory containing a cell typing application that configures the processor to obtain single cell ribonucleic acid sequencing (scRNA-seq) data generated from a single cell, generate a two-dimensional (2D) image includes a grid of pixels, where each pixel describes a gene-gene interaction based upon the scRNA-seq data, provide the 2D image to a convolutional neural network (CNN), obtain a cell classification of the single cell from the CNN, and provide the cell classification via a display.
In a further embodiment, to generate the 2D image, the cell typing application further directs the processor to generate a pairwise interaction strength matrix that maximizes entropy of the scRNA-seq data, generate a distance matrix for the 2D image, where the interaction strength matrix and the distance matrix have the same dimensions, optimize a transport function to produce a transport matrix coupling the distance matrix and the interaction strength matrix, and transpose the scRNA-seq data into the 2D image using the transport matrix.
In still another embodiment, to generate the pairwise interaction strength matrix, the cell typing application further directs the processor to maximize system entropy using a multivariate Gaussian distribution.
In a still further embodiment, the transport function is optimized using a Gromov-Wasserstein discrepancy.
In yet another embodiment, a loss function of the Gromov-Wasserstein discrepancy uses Kullback-Leibler divergence.
In a yet further embodiment, the Gromov-Wasserstein discrepancy is calculated using a regularized approximation.
In another additional embodiment, the CNN is trained using a training data set includes 2D images includes a grid of pixels, where each pixel describes a gene-gene interaction based upon a different scRNA-seq data, where each 2D image is labeled with a cell type from which the different scRNA-seq data was obtained.
In a further additional embodiment, the method further includes steps for a sequencer configured to generate the scRNA-seq data from a cell sample.
In another embodiment again, the cell typing application further directs the processor to provide the 2D image via the display.
In a further embodiment again, the CNN further provides a confidence metric reflecting probability of the cell classification being correct. 1Systems and methods for [PURPOSE] in accordance with embodiments of the invention are illustrated. One embodiment includes a method for cell typing, including obtaining single cell ribonucleic acid sequencing (scRNA-seq) data generated from a single cell, generating a two-dimensional (2D) image includes a grid of pixels, where each pixel describes a gene-gene interaction based upon the scRNA-seq data, providing the 2D image to a convolutional neural network (CNN), obtaining a cell classification of the single cell from the CNN, and providing the cell classification via a display. 12. The method for cell typing of claim 1, generating the 2D image includes generating a pairwise interaction strength matrix that maximizes entropy of the scRNA-seq data, generating a distance matrix for the 2D image, where the interaction strength matrix and the distance matrix have the same dimensions, optimizing a transport function to produce a transport matrix coupling the distance matrix and the interaction strength matrix, and transposing the scRNA-seq data into the 2D image using the transport matrix.
In still yet another embodiment, generating the pairwise interaction strength matrix includes maximizing system entropy using a multivariate Gaussian distribution.
In a still yet further embodiment, the transport function is optimized using a Gromov-Wasserstein discrepancy.
In still another additional embodiment, a loss function of the Gromov-Wasserstein discrepancy uses Kullback-Leibler divergence.
In a still further additional embodiment, the Gromov-Wasserstein discrepancy is calculated using a regularized approximation.
In still another embodiment again, the CNN is trained using a training data set includes 2D images includes a grid of pixels, where each pixel describes a gene-gene interaction based upon a different scRNA-seq data, where each 2D image is labeled with a cell type from which the different scRNA-seq data was obtained.
In a still further embodiment again, the method further includes steps for generating the scRNA-seq data from a cell sample using a sequencer.
In yet another additional embodiment, the method further includes steps for providing the 2D image via the display.
In a yet further additional embodiment, the CNN further provides a confidence metric reflecting probability of the cell classification being correct.
Additional embodiments and features are set forth in part in the description that follows, and in part will become apparent to those skilled in the art upon examination of the specification or may be learned by the practice of the invention. A further understanding of the nature and advantages of the present invention may be realized by reference to the remaining portions of the specification and the drawings, which forms a part of this disclosure.
The description and claims will be more fully understood with reference to the following figures and data graphs, which are presented as exemplary embodiments of the invention and should not be construed as a complete recitation of the scope of the invention.
Cell typing is a critical tool in both medicine and biological research. For example, cell typing can be used to investigate specific types of cancers and assist with directing disease treatments. However, conventional analytical techniques identifying individual cells from scRNA-seq data such as (but not limited to) discriminant analysis, Bayesian classification, decision-trees, and neural networks, all are deficient in extracting the most discriminative features. In part, this is because scRNA-seq data is typically stored as a vector or matrix, which is convenient from a data storage perspective but is not the most informative way to present the data. When stored in this manner, the information of gene-gene interactions is buried in the unordered expression matrix.
Systems and methods described herein describe and utilize GenoMap, a new structure for representing scRNA-seq data in a two-dimensional (2D) format which relates a gene placement configuration in 2D and a gene-gene interaction matrix computed from high dimension. In many embodiments, the gene-gene interaction matrix is computed by maximizing the entropy of the genomic data. As the possible combinations of placing the genes into a 2D grid is a factorial of the number of involved genes (typically around 20,000 for a human cell), a robust optimization can have significant impact on subsequent analysis. GenoMaps possess the basic characteristic of an image where the pixel configuration is determined by the gene-gene interactions of the cell. After GenoMap construction, a convolutional neural network can be used to extract genomic interaction features in order to type the cell. Turning now to the drawings, systems and methods for cell typing using GenoMaps are illustrated. Systems for cell typing are discussed first below.
Cell typing systems are computational systems which obtain scRNA-seq data and convert said data into a GenoMap. In many embodiments, cell typing systems further utilize CNNs to process GenoMaps in order to type cells. Turning now to
Turning now to
While a specific system architecture and typing device architecture are illustrated in
Cell typing processes as described herein use GenoMaps as an intermediary between scRNA-seq data and a classification. In numerous embodiments, the type of cell may be readily apparently from the GenoMap alone to a human user. However, in various embodiments, machine learning is used to classify a cell using a GenoMap with significantly higher precision than possible by human eye alone.
Turning now to
In order to reconfigure scRNA-seq data into a GenoMap, first, a pairwise interaction strength matrix that maximizes the entropy of the scRNA-seq data is generated. Then the genes in the scRNA-seq data are placed in a 2D grid such that the pairwise interaction is preserved maximally. An optimal transport optimization (i.e. minimization of Gromov-Wasserstein discrepancy between the interaction-space of genes and the Euclidean space of the 2D grid) can be used to solve the problem efficiently.
Assuming a data set Is∈m×n from an experiment on m number of cells (each cell has n number of genes), the objective is to restructure the n genes of each cell into a 2D grid of size p×q, p×q≥n to maximize the entropy of the data. Entropy is frequently used in information theory to measure the information content of a system. Mathematically, entropy measures the uncertainty associated with a random variable or a random system. The entropy H(X) of a discrete random variable X can be written as:
where p()=P(X=
),
∈
, denotes the probability mass function (pmf) of the random variable X, and
is a finite set (such as {1,2, . . . }).
The scRNA-seq data for a cell of n genes can be written as a state vector =(
1, . . . ,
n)), where
i is the expression level of the i-th gene. For m number of cells, there are m state vectors
1 . . .
m. The goal is to restructure the genes in each cell in such a way that maximizes the entropy of the gene expression vectors:
subject to the constraints
The probability mass function for the gene expressions, which maximizes the system entropy is given by a multivariate Gaussian distribution parametrized by the mean (x) and the covariance matrix Ω as follows:
Here, the covariance matrix is defined as:
The pairwise interaction strength between i and
j can then be computed from the covariance matrix as follows:
The problem of constructing the GenoMap (i.e. optimally placing n-genes to n positions of the 2D grid of p×q(n≤p×q)) can be written as Gromov-Wasserstein discrepancy between the scaled pair interaction strength matrix C of n genes and the distance matrix (
Here, the transport matrix T is a coupling between the two spaces on which C and a log(a/b)−a+b. Introducing the 4-way tensor:
Here ⊗ denotes the tensor-matrix multiplication as follows:
In many embodiments, a regularized Gromov-Wasserstein discrepancy is used for computational efficiency. The regularized approximation of the original Gromov-Wasserstein formulation is:
where ε is a regularization parameter and the entropy of T∈+n×n is defined as
A projected gradient descent is used to solve nonconvex optimization problem, where both the gradient step and the projection are computed according to the KL metric. Iterations of this algorithm are given by
where τ>0 is a small step size, and the KL projector of any matrix K is:
In the special case τ=1/ε, then
In order to additionally speed up computation, if the loss can be written as
for functions (f1, f2, h1, h2), then, for any T∈u,v78,
where cC,(C,
Turning now to
GenoMaps can also be used for a wide variety of applications (but not limited to) including cell clustering, gene signature extraction, single cell data integration, and cellular trajectory analysis. Furthermore, while GenoMaps are discussed with respect to scRNA-seq data, a similar mathematical approach can be implemented with respect to any high-dimensional tabular data set to graphically represent them in the general form of a GenoMap, the TabMap.
In this way, complex datasets can be depicted in imaging format, with the relationships among the data components encoded in terms of the pixelated configuration. These configurations can then be processed using CNNs, which are highly effective tools for image processing. TabMaps in their general form can be used for any dataset specific task such as (but not limited to) dimensionality reduction, visualization, and/or any other function that a machine learning model can perform. It is therefore to be understood that the present invention may be practiced in ways other than specifically described, without departing from the scope and spirit of the present invention. Thus, embodiments of the present invention should be considered in all respects as illustrative and not restrictive. Accordingly, the scope of the invention should be determined not by the embodiments illustrated, but by the appended claims and their equivalents.
The current application claims the benefit of and priority under 35 U.S.C. § 119(e) to U.S. Provisional Patent Application No. 63/479,724 entitled “Cartography of Genomic Interactions Enables Deep Analysis of Single-Cell Expression Data” filed Jan. 12, 2023. The disclosure of U.S. Provisional Patent Application No. 63/479,724 is hereby incorporated by reference in its entirety for all purposes.
Number | Date | Country | |
---|---|---|---|
63479724 | Jan 2023 | US |