Gene-expression profiling of whole tissues is affected by the different cell types that exist in the tissue and their relative proportions. Thus, changes detected by differential expression analysis may reflect differences in the proportions of the cell-types between samples rather than an important mechanistic change in gene expression. For example, the proportion of tumor cells in breast cancer biopsies was found to significantly affect expression profiles, where consideration of these proportions improved response prediction. Therefore, profiling of heterogeneous tissues rather than sorted cell-types can greatly limit the conclusions derived from such analyses.
Solutions for experimentally separating cell-types from heterogeneous tissues include laser-capture microdissection to isolate morphologically distinguishable cells and flow cell sorting to purify cell-types from a tissue. However, in addition to the time-consuming nature of these methods, they may result in insufficient quantities of RNA, where amplification steps may introduce artifacts to the gene expression data. Single cell RNA sequencing is becoming feasible; however, experimental costs are high and few studies utilize this method on a large patient pool. To address this issue, several approaches to computationally separate expression profiles of heterogeneous tissues into their constituent cell-types along with their relative proportions per sample have been developed. Most approaches utilize a linear model that has been demonstrated to yield accurate expression estimates; in such a model the gene-expressions of each cell-type are added up to form a mixed expression, where each cell-type is weighted according to its relative proportion in the tissue.
All currently existing separation methods require some a priori information about the tissue analyzed, such as the number of cell-types and their relative proportions in the tissue, or the number of cell-types, their identity and their purified gene expression, or just the number of cell-types in the tissue. A preliminary attempt to estimate the number of cell-types in the mixed data, but not their identities, has also been proposed. However, most studies do not purify the different cell populations in the tissue, enumerate their proportions or verify their identity, rendering these methods inapplicable to separation of such heterogeneous gene-expression datasets.
We have developed a novel approach to blindly separate heterogeneous gene-expression data, i.e., without using any specific prior information regarding the analyzed dataset. In addition to separating the heterogeneous tissue to the individual gene expression profiles of its constituent cell-types and their relative proportions per sample, the algorithm described here performs an extra step of identifying the number of cell-types in the tissue and their identities. Compared to existing methods, the only a priori information the algorithm requires is an initial guess of the cell-types that may exist in the analyzed tissue and purified reference signatures of these cell-types, which may be found in abundance in publically available databases. We have successfully tested our algorithm on three publically available databases in which all the conditions are controlled and on a publically available semi-controlled dataset with estimated cell-type proportions.
This method is the first that can practically be applied, in a “plug and play” fashion, to any existing dataset of heterogeneous tissue samples, in order to identify the cell-types in the samples, their identities, their proportions per sample and their separated gene-expression signatures without requiring any prior knowledge.
The algorithm disclosed herein is designed to identify the number of cell-types in heterogeneous tissue samples, their identities, their relative proportions per sample and separate their individual gene expression signatures. The proposed algorithm includes three parts. In the first part, non-negative matrix factorization is used to obtain an initial estimate of expression profiles for each cell-type. A rough initial estimate of the numbers and identities of the cell-types in the tissue is required. This estimate can include cell-types that may not exist in the tissue. However, if a true cell-type is not included in the initial estimate, then the algorithm will not detect this cell-type and there may be ambiguities in the resulting cell-type signatures and proportions. In addition, purified reference signatures are required for each of the cell-types included in the initial estimate. Such reference signatures may be found in abundance in the Gene Expression Omnibus (GEO; Barrett T, et al. (2009) Nucleic Acids Res 37: D88S-890, the contents of which are incorporated herein by reference) and may be general, i.e., not disease, tissue, experiment or study-specific. In the second part of the algorithm, the true number of cell-types is estimated using the symmetric Kullback-Leibler divergence (SKLD) between each of the estimated cell-type profiles and the initial cell-type reference signatures, where the closest estimated profiles are then chosen as the final cell-types. SKLD, a measure used to calculate the difference between two probability distributions, is used here as a measure of distance, as we describe below under equation (5). In the final part, the cell-type proportions are computed per sample, using the method of non-negative least squares (NNLS), a method that solves matrix equations algebraically with an added constraint for non-negative elements, as we describe in the methods section under equation (3). Additional adjustments, motivated by the application of the algorithm to gene-expression data include: (a) majority voting, where the final identity of the cell-types is chosen from the results of several algorithm runs with random initializations, and (b) usage of classes, where several input reference signatures may be grouped into one “class” of cell-type. These adjustments were added to the algorithm to improve separation capabilities of cell-types with similar signatures and increase the algorithm's robustness to noisy reference signatures.
The following linear model is widely used for separation of gene expression:
M
ij=Σk=1k
where Mij is the mixed expression matrix of gene i in sample j, Gik is the separated cell-type specific gene-expression matrix of gene i in cell type k and Ckj is the matrix of relative proportion of cell type kin sample j; kct is the total number of cell-types in the tissue, m and n are the total number of genes and samples, respectively. Studies based on the model in (1) have shown that separation of mixed data with known proportions yielded cell-type specific expression estimates that were highly correlated with the corresponding purified cell gene-expression, rendering the linearity assumption acceptable. All currently existing approaches, whether they use the linear model or not, require some a priori information about the tissue analyzed, such as the number of cell types, their identity or their relative proportions in each sample. We are interested in estimating G and C, from the observation M, without explicit a priori knowledge of the number of cell-types in the tissue, kct, or their identities (note that we will use upper-case boldface letters to denote matrices and lower-case boldface letters to denote vectors). Rather, we consider a collection of kmax cell-types representing all possible cell-type types assumed to comprise the analyzed tissue. This is a hypothesis-testing problem, where each possible combination of cell-types is a hypothesis. Our objective is to choose the correct hypothesis, i.e., to determine which cell-types exist in the analyzed tissue. Assume that T is a label of a cell-type. We begin with a collection of labels {T′1, . . . , T′k
The relation to Hyper Spectral Imaging
Separation of gene-expression can be viewed as a special case of a more general class of problems known as Nonnegative Matrix Factorization (NMF) problems, defined as follows: given a nonnegative data matrix M, find the smallest dimension matrices G and C with non-negative entries such that
M≈GC (2)
where G is referred to as an end-members matrix (where end-members are classes of composing materials that make up the object M), and C represents the relative proportions in which the end-members are mixed in M, G's ith column represents the signature of the ith end-member, and C's kjth entry represent the relative proportion of the kth end-member in the jth data vector mj. This is equivalent to writing (1) in a matrix form, where each data vector mj represents microarray measurements of sample j. Each cell type is an end-member, where G's ith column represents the gene signature of the ith cell-type. The jth column of C represents the relative proportions of the cell types (whose signatures comprise the columns of G) in sample j. If the number of cell types is smaller than the number of samples, the dimensions of G and C are smaller than the dimension of M, and the problem in (1) is a special case of the problem in (2).
The algorithm proposed herein is an adaptation of an NMF algorithm by Piper et al. (Object Characterization from spectral data using nonnegative factorization and information theory; 2004; Maui, HI) that was designed for spectral analysis of space objects. Piper et al. studied the problem of identification and classification of space objects whose orbits are significantly distant (e.g., geosynchronous satellites) or whose dimensions are small (e.g., nanosatellites) from ground-based telescope spectral measurements. In their problem, an object is classified by determining the characteristics of the material that make up its spectral trace. Each data vector mj represents a spectral trace (i.e. the spectral image) of the jth object. G's ith column represents the spectral signature of the ith material in the object (end-member). Piper et al.'s hyper-spectral analysis approach is useful for analysis of gene-expression microarrays due to the use of prior knowledge. Their method uses a stored set of laboratory-obtained spectral signatures of space object materials obtained in a different experiment to determine the number of end-members. These stored signatures are not necessarily identical to the underline signatures but are only close to them. This approach is very appealing for separation of gene-expression, since in most cases the purified cell-types are not separated and profiled separately in the same experiment. Furthermore, it is possible to obtain cell-type specific reference signatures and use them for any analysis involving similar cell types. Despite the similarity between the two NMF applications, i.e. gene-expression analysis and spectral analysis, extensions to Piper et al.'s algorithm designed for spectral analysis were needed for the gene-expression analysis, as described in the following section.
Our algorithm includes three major parts. In the first part, we obtain an initial estimate of the matrix G using kmax as the number of columns. In the second part we estimate the true number of cell-types, {circumflex over (k)}CT their identities, and the cell-type expression signatures matrix Ĝ. In the final part we compute the cell-type proportions matrix C.
Initialization: The algorithm receives as input: (a) an m×n matrix M, and (b) an m×kmax matrix L, where M is the mixed matrix to be separated with m genes and n samples and L is the reference signatures matrix with m genes and kmax columns. Both M and L have non-negative entries and are normalized such that each column sums to its mean. The matrices H and W, which represent intermediate estimates of the C and G matrices, are initialized as follows. The entries Hkj1≤k≤kmax, 1≤j≤n are realized values of independent random variables, uniformly distributed from zero to one. The matrix W is initialized with the reference signatures matrix L and the columns of W are scaled to sum to one.
Evaluation of H and W: In the first stage, the algorithm receives the matrix M and the integer kmax as inputs and outputs H and W such that
M≈WH, (3)
using NMF [13]; i.e., H, W minimizes ∥M−WH ∥F where ∥−∥F is the Frobenius norm (the root sum of squares of the entries of the matrix), under the constraint that H and W have positive entries and the columns of W sum to one. The matrices W and H serve as intermediate representations of the matrices G and C, respectively.
Estimation of kCT and G: The true number of cell-types in M, kCT, is estimated by:
{circumflex over (k)}
CT=|{argminw
Recall that kmax is greater than the true number of cell-types kct thus some of the columns of the matrix W are redundant. Each column in L, II, is associated with a column in W, wi, to which it has the minimal distance, D. Here, SKLD is used, as in Piper et al. The SKLD is defined as follows, let w and d be two signatures and let p=w/Σiwi, q=d/Σidi, the SKLD is defined as
D
s(w, d)=D(w, d)+D(w, d), (5)
where D(w, d)=Σipi log pi/qi. We have also run the algorithm using Euclidean distance and correlation as the distance measures; however the results were not as accurate as using SKLD. This may be explained by the fact that the SKLD is most suitable with the NMF used in our algorithm, as it only considers arguments with positive values. The estimated number of cell types, {circumflex over (k)}CT is set to the number of chosen columns in W. Note that it is possible that some of the columns in W will not be chosen. The cell type identity of each of the chosen wi, columns is determined according to its corresponding Ii column. In cases where more than one column in L is associated with a certain wi, the identity of that wi, is determined according to the Ii it has the minimal SKLD from. The estimated G matrix, Ĝ, is then constituted from the chosen columns of W.
Estimation of C: The estimate of the matrix C, Ĉ, is obtained by using NNLS using Ĝ and M, such that
M≈ĜĈ, (6)
under the constraint that the entries of Ĉ are greater than or equal to zero. Finally, the rows of Ĉ are normalized to 1 to represent cell-type proportions. The output of the algorithm is the matrices Ĉ and Ĝ, representing the proportions of each cell type in each sample and the specific gene expression for each separated cell type, respectively. Pseudo code of the algorithm is given in Example 1.
Majority voting: The NMF algorithm used to evaluate H and W is not guaranteed to converge to a global minimum, as the NMF is not a convex optimization problem. This problem is most significant in cases where the cell-types have similar signatures (e.g., immune cell subsets as in the T-B-Monocytes dataset). To overcome this problem, we have initialized the W matrix with the input signatures matrix L and, in addition, set an option to run the algorithm several times using random initializations of H. Each run yields the estimate Ĝ in which each column represents a cell-type that was chosen by the algorithm. The algorithm decides whether a certain cell-type is chosen for the final estimate of G, Ĝ, if it is chosen more than a certain threshold, defined as the percentage of the number of times this cell-type was chosen out of the number of total runs. The estimated gene expression of each chosen cell-type is set to the average of the gene-expression of all corresponding estimates of this cell-type in each run it was chosen. The final estimate of the number of cell-types {circumflex over (K)}CT is set to the number of columns of the final Ĝ matrix.
Classes: The algorithm utilizes the reference signatures matrix L. As such, it is sensitive to the signatures provided by the user and may fail to accurately separate cell-types in cases where the cell-types are very similar or if the user is missing a priori information regarding the tissue to be separated, e.g. the exact nature of the cell type, tissue or experimental conditions. To improve performance in such cases, we have allowed for reference signatures to be grouped into classes with a single label. For example, to separate colorectal tumor cells of an unknown subtype from a mixed tissue, reference signatures for several colorectal tumor types representing different tumor subtypes may be provided and will constitute the class “colorectal tumor.” An additional example for using classes includes unifying several signatures for one cell type taken from different studies, e.g. purified heart cells from two different studies, under the class “heart.” This allows us to use more than one signature for each cell-type, which increases the robustness of the algorithm in cases where the reference signatures are noisy. The algorithm first estimates Ĝ as if there are no classes. Then, all W columns associated with the same class are averaged and labeled according to that class.
To separate a heterogeneous tissue, the user should have some knowledge regarding the nature of the tissue that is being separated and its possible cell-type constituents. Purified signatures of the candidate cell-types may be found in public repositories such as GEO via a simple search for the required cell-type and species. The chosen signatures need not be from the same disease, tissue study or experiment as the heterogeneous tissue to be separated. In case there are many possible relevant options from different studies for a cell-type, one can input several signatures of the same cell-type to the algorithm and gather them under the same class. This was demonstrated in the heart-brain and T-B-Monocyte dataset examples. The limit in the total number of signatures used for all cell-types is the number of samples of the mixed tissue that is being separated, as explained under “Linear model for separation of gene-expression.” In case of uncertainty as to what cell-types constitute the tissue, one does not have to be precise and can over-guess by inputting many, even un-related, cell-types into the algorithm. Note that underguessing the number of cell-types may cause ambiguities in the algorithm results, as explained above.
Parameters setting. Parameters concerning majority voting (threshold, number of majority voting runs) and classes were set according to the nature of the cell-type signatures in each dataset, based on trial and error and common sense. In the case of majority voting, the more the input reference signatures are similar (such as in the T -B-Monocyte dataset (Abbas AR, et al. (2009) PLoS One 4: e6098), see also
For classes' parameters, we observed that the algorithm encounters difficulties in separating cell-types for which the input reference signatures are very similar. In such instances, one might consider unifying these signatures under one class (where biologically relevant) or seek reference signatures from a different source. Observation of the input reference signatures, e.g., by drawing their heatmaps (
We tested the algorithm on three publically available datasets in which known proportions of known cell types were mixed and their gene-expression was measured.
The algorithm was run with the following parameters for each dataset: (1) liver-brain-lung dataset (Shen-Orr SS, et al. (2010) Nat Methods 7: 287-289): majority voting threshold=70%, majority voting runs=10, classes=none. (2) Heart-brain dataset (http://www.affymetrix.com/support/technical/sample_data/gene_1_0_array_data.affx): majority voting threshold=70%, majority voting runs=10, classes=unifying the two brain and two heart cell types to the classes “brain” and “heart,” respectively. (3) T-B-Monocytes dataset (Abbas AR, et al. (2009) PLoS One 4:e6098): majority voting threshold=70%, majority voting runs=20, classes=unifying the two B cell line types to the class “B cells.”
Microarray data. All microarray data were downloaded from GEO (Barrett T, et al. (2009) Nucleic Acids Res 37: D885-890) as raw .CEL files and RMA normalized using R© package “affy.” The datasets and reference signatures for each analysis were jointly normalized using quantile normalization using R© package “limma.” The following accession numbers were used for each dataset: (1) Liver-brain lung dataset (Shen-Orr SS, et al. (2010) Nat Methods 7: 287-289) (GSE19830), with reference signatures of purified rat liver (GSE8252), brain (GSE3428), lung (GSE16849), intestine (GSE16849), heart (GSE5085) and granulosa (GSE13883) cells. All reference signatures were chosen from the same platform as the analyzed data—Affymetrix Rat Genome 230 2.0 Array. (2) Heart-brain dataset (http://www.affymetrix.com/support/technical/sample_data/gene_1_0_array_data.affx), with reference signatures of purified human myocardial (heart) cells from two different studies (GSE21610, GSE29819), brain cells from the entorhinal cortex (GSE4757) grey matter (GSE28146), oocytes (GSE12034) and hepatocyte (GSE31264). All reference signatures were from the same platform as the analyzed data—Human Genome U133 Plus 2.0 Array. (3) T-B-Monocytes dataset (Abbas AR, et al. (2009) PLoS One 4: e6098) (GSE11058), with reference signatures of purified T cell Jurkat (GSE7508, GSE30678), Monocyte THP-1 (GSE26868), B cell Raji (GSE12278, GSE13210) and IM-9 (GSE24147), IMC-1 NK (GSE19067) and MCF-10A epithelial (GSE10196) cell-lines. All reference signatures were from the same platform as the analyzed data—Affymetrix Human Genome U133 Plus 2.0 Array.
The liver-brain-lung dataset includes samples of rat liver, brain and lung cell mixtures (Shen-Orr SS, et al. (2010) Nat Methods 7: 287-289). The purified cell-type reference signatures were collected from GEO and included rat liver, brain, lung, intestine, heart and granulosa cell gene-expression profiles from different studies (see “microarray data” below;
The Heart-Brain dataset includes samples of heart and brain human cell mixtures (http://www.affymetrix.com/support/technical/sample_data/gene_1_0_array_data.affx). Purified cell reference signatures were collected from GEO and included myocardial (heart) cells, brain cells from the entorhinal cortex and grey matter, oocytes and hepatocytes from different studies (see “micro array data” in methods section;
To test separation of cell-types with similar signatures, we chose the T-B-Monocytes dataset, containing mixtures of T, Monocyte and two types of B cell lines (Abbas AR, et al. (2009) PLoS One 4: e6098). Purified cell reference signatures collected from GEO included human immune cell lines of T-cells, B-cells, Monocytes, NK cells and epithelial cells (see “microarray data” below;
We tested the algorithm on a semi-controlled dataset of prostate cancer in which cell-type proportions were estimated by a pathologist (Wang Y, et al. (2010) Cancer Res 70: 6448-6455). The cell-types in the analyzed tissue were carcinoma, benign (BPHE) and dilated (DCAE) epithelial and stromal cells. Purified cell signatures of prostate tumor cell lines, benign prostate cells, normal prostate epithelial cells, stroma surrounding invasive prostate tumors and normal stroma were collected from GEO (see “microarray data” below;
Microarray data. All micro array data were downloaded from GEO (Barrett T, et al. (2009) Nucleic Acids Res 37: D885-890) as raw .CEL files and RMA normalized using R© package “affy.” The datasets and reference signatures for each analysis were jointly normalized using quantile normalization using R© package “limma.” The prostate cancer dataset (Wang Y, et al. (2010) Cancer Res 70: 6448-6455) (GSE17951) included 154 patient samples with proportions of the tumor cells available for 137 samples. Reference signatures included purified prostate tumor cell lines—DU145, PC3, CWR22Rv, LAPC4, C42B, LNCaP (GSE12348), benign prostate cells (GSE3325), normal prostate epithelial cells (GSE9951), stroma surrounding invasive prostate primary tumors and normal stroma (GSE26910). All reference signatures and analyzed data were taken from two similar platforms—Affymetrix Human Genome U133A Array and U133 Plus 2.0 Array.
A modified version of the separation algorithm without the cell-type determination step was run on the three cell-types liver-brain-lung dataset, using six, five and four reference cell-type signatures mined from GEO. The results show that in the case of algorithms that do not have a cell-type determination mechanism, such an over-fit (addition of extra cell-types) is insignificant if the resulting proportions of the additional cell-types are close to zero. However, this example clearly shows that this is not the case and that the over-fit significantly degraded the performance of the algorithm. Hence the cell-type determination step is crucial.
A modified version of the separation algorithm without the cell-type determination step was run on the four cell-types T-B-Monocytes dataset (which includes two different B cell line types), using six and five reference cell-type signatures mined from GEO. The results show that in the case of algorithms that do not have a cell-type determination mechanism, such an over-fit (addition of extra cell-types) is insignificant if the resulting proportions of the additional cell-types are close to zero. However, this example clearly shows that this is not the case and that the over-fit significantly degraded the performance of the algorithm. Hence the cell-type determination step is crucial.
In this example we used a NNLS (non-negative least squares)-based algorithm, which was used as a benchmark to most NNLS-based separation algorithms. The reference signatures were used to extract the proportions matrix. This algorithm was run on the three cell-types liver-brain-lung dataset, using six, five and four reference cell-type signatures mined from GEO. For NNLS-based algorithms which do not require any prior information, an over-fit (i.e., assume that there are more cell-types than actually exist) is insignificant if the resulting proportions of the additional cell-types are close to zero. However, this example clearly shows that this is not the case and that the over-fit significantly degraded the performance of the algorithm. Hence the cell-type determination step and the usage of NNMF (non-negative matrix factorization) are crucial.
In this example we used a NNLS (non-negative least squares)-based algorithm, which was used as a benchmark to most NNLS-based separation algorithms. The reference signatures were used to extract the proportions matrix. This algorithm was run on the four cell-types T-B-Monocytes dataset (which includes two different B cell line types), using six and five reference cell-type signatures mined from GEO. For NNLS-based algorithms that do not require any prior information, an over-fit (i.e., assume that there are more cell-types than actually exist) is insignificant if the resulting proportions of the additional cell-types are close to zero. However, this example clearly shows that this is not the case and that the over-fit significantly degraded the performance of the algorithm. Hence the cell-type determination step and the usage of NNMF (non-negative matrix factorization) are crucial.
Cells highlighted in orange show the real proportions (where T-Jurkat=CT1, BRaji/IM9=CT2, Monocytes=CT4); cells highlighted in grey are the cell-types which were mistakenly assumed to be present but were not removed because the cell-type determination step was not included here, as in our complete algorithm.
Gene-expression analysis of whole tissues, which are heterogeneous in nature and consist of a mixture of several cell-types, are utilized extensively and are highly abundant in public repositories such as GEO (Barrett T, et al. (2009) Nucleic Acids Res 37: D885-890). However, it is now becoming clear that the identity, composition and profiles of individual cell-types are extremely important to the process of unraveling the biology of each cell-type population and the interplay between the populations in both healthy and disease states. Due to the expense and difficulties of separating them, only a limited amount of studies profile and analyze individual cell-types. More importantly, public repositories are replete with existing data of whole tissues including thousands of patients, treatments, tissues and cell-types. This rich trove of data is from experiments that may never be repeated using such large patient pool or experimental conditions. Our techniques can realize the great potential of these data, which contains much information about the constituent individual cell-types in heterogeneous tissues that, to date, have not been fully interrogated.
Computational methods have been developed to allow the separation of heterogeneous tissues into their cell-type constituent profiles and/or relative proportions. However, all currently existing separation methods require that the number of cell-types in the tissue, their identity, or their relative proportions in the analyzed tissue are known. Such information rarely exists, as most profiling studies do not purify the cell-types in the tissue, extract their proportions or verify their identity, rendering the existing separation methods non-usable for most existing datasets; rather, these datasets are usable only in experiments designed in advance to allow for the separation technique.
We have developed a separation method that requires no a priori information about the tissue analyzed other than an initial rough estimate of the cell-types that may exist in the tissue samples analyzed. This is a reasonable input to ask for and relatively easy to find, as information regarding the composition of most tissues is readily available in the literature and public databases such as GEO are replete with many types of purified cell-types from various experiments. As our algorithm does not require the purified cell-type profiles to be disease, tissue or even study-specific, one can simply use any relevant purified profile as an input to the algorithm. These properties render our algorithm the only useful method to separate most publically available heterogeneous micro array datasets.
We successfully applied our separation technique to three controlled datasets with known proportions and cell types in addition to a semi-controlled dataset where cell-type proportions per sample were estimated by a pathologist, to test the method on a dataset that resembles the heterogeneous datasets available in the literature rather than on datasets specifically engineered for separation. Our blind separation technique accurately extracted the relative cell-type proportions per sample and their separated gene-expression signatures and performed just as well, and in some cell-types even better, than other reported separation techniques that require different types of input information about the dataset analyzed to be available. Most importantly, our technique successfully identified the number of cell-types in the tissues analyzed and their identities. These features are not included in any of the reported separation techniques, and are in fact considered as an integral input for the usage of these techniques. It is these features that are mostly unavailable for publically available datasets, or any dataset in which they have not been experimentally identified. In addition, the cell-type populations and proportions in a tissue are not always consistent amongst different individuals, which renders the identification of those populations and their identities crucial.
The algorithm's robustness to varying input signatures was demonstrated by using additional cell-type signatures that were not related to the analyzed tissue as input to each controlled dataset (e.g., the intestine, heart and granulosa cell-types were input to the liver-brain-lung dataset). To address the algorithm's robustness to signatures of different qualities, signatures from different studies were used for the same cell-type and gathered under the same class (e.g., two T-cell Jurkat and B-cell Raji cell-lines from different studies were input to the T-B-Monocytes dataset). The algorithm identified the correct number of cell-types and their correct identities in all examples. In general, the algorithm performed better when separating cell-types that were very different from one another as in the heart-brain dataset, compared to cell-types that were very similar to each other such as in the T-B-monocyte dataset. However, in the latter example, given that no a priori data about the mixed tissue was provided, the algorithm still yielded accurate results. In particular, the algorithm identified all three cell-types (T, Band Monocytes) with an error that was close to that reported by the original study where the number of cell-types, their identity and their true gene-expression profiles were given as an input. Moreover, the algorithm also successfully separated the two B cell-lines, cell-types with an almost identical gene expression. A comparison between the true purified signatures from the same study to the input signatures mined from GEO and the resulting signatures inferred by our algorithm showed that, in each of the datasets explored, the resulting signatures were always closer to the true signatures than the signatures from GEO, demonstrating that our algorithm successfully identifies the input signatures close to the true ones.
Compared to existing algorithms, our algorithm yielded at least comparable results, and in some cases better results (such as predicting the lung cell-type in the liver-brain-lung dataset). An important distinction is that our algorithm does not require the a priori information required in existing algorithms and, in contrast with those algorithms, it is able to determine the number of cell-types in the heterogeneous tissue and their identities. To demonstrate the importance of this added capability, we compared the performance of our algorithm to an NNMF approach, without the cell-type determination step, which is initialized in the same manner as our algorithm (
In summary, our blind separation technique successfully identifies the cell-type composition in heterogeneous gene-expression data, and provides high accuracy estimates of cell-type specific signatures and their relative proportions per sample. The only information the algorithm requires is an initial estimate of the cell-types that may exist in the tissue analyzed and their signatures, which can be easily found in public databases such as GEO. This method is especially advantageous for re-analyzing existing microarray data for which no additional information is available, allowing re-examination and extraction of information for individual cell-type populations while taking advantage of already-existing, large-scale microarray datasets.
The separation algorithm was applied to publically available datasets collected from the Gene Expression Omnibus (GEO), described and exemplified as follows.
Application of separation algorithm: 1) Download heterogeneous tissues dataset from GEO, normalize and annotate. This will be referred to as “Mix.” In this example, 26 different datasets of patients' primary breast cancer heterogeneous tumor tissue were downloaded, with a total of 4,505 samples with various clinical parameters (Table 2).
Platforms GPL570 and GPL96 refer to Affymetrix Human Genome U133 Plus 2.0 and U133A arrays, respectively. Number of arrays refers to number of patient samples in each dataset 2) Collect relevant reference signatures (purified cell types of cells suspected to exist in analyzed tissue) from GEO, normalize and annotate. This will be referred to as “Signatures.” In this example, 47 difference purified cell type reference signatures from 30 different datasets were used. Reference signatures underwent quality control to check whether similar cell types cluster together (via hierarchical clustering of the correlations of the input reference signatures,
3) Prepare input files for separation algorithm: normalize Mix and Signatures together (e.g. using quantile normalization) and print to input files; prepare additional input files such as cell type class annotations. In this example, the class annotation was set such that the algorithm will aim to separate the heterogeneous tissues into 5 main cell type classes: tumor cells, T cells, B cells, Myeloid cells (including Macrophages, Dendritic cells, Monocytes and Neutrophils) and non-immune stromal cells (NIS, including fibroblasts, mesenchymal stem cells and adipocytes).
4) Run separation algorithm. The output of the algorithm includes the following files: CT—cell types identified, Gres—estimated separated cell type gene expression signatures, Cres—estimated cell type proportions per sample. In this example, the algorithm was applied separately to each dataset. Thus, the outputs CT, Gres and Cres were generated for each of the 26 datasets, but analyzed collectively.
5) Analyze algorithm output. For example, correlation of Cres to patient clinical parameters and gene expression and pathway analysis of Gres. In this example, we calculated overall average numbers of cell types in breast cancer primary tumors (
This application claims priority from provisional application 61/856,271, filed Jul. 19, 2013, the contents of which are incorporated herein by reference in their entirety.
This invention was made with government support under contract N000140910072P00006 awarded by the Office of Naval Research; under contract FA 9550-08-1-0480 awarded by the Air Force Office of Scientific Research; and under contract HDTRA1-08-1-0010 awarded by the Defense Threat Reduction Agency. The government has certain rights in this invention.
Number | Date | Country | |
---|---|---|---|
61856271 | Jul 2013 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 14336434 | Jul 2014 | US |
Child | 16930012 | US |