Embodiments of the present invention relate generally to the use of gene expression data, and in particular to the use of gene expression data across different profiling platforms.
The dynamic ranges of gene expressions can vary considerably depending on the choice of profiling platform. As a result, prognostic gene signatures are usually platform-specific. In general, expression data generated by heterogeneous platforms cannot be directly combined for computational analysis, thus limiting the use of legacy data and hindering the adoption of new profiling technologies. More specifically, in can be difficult to transfer the knowledge and insight gained from the tremendous amount of legacy microarray studies onto new platforms such as Next Generation Sequencing (NGS) systems.
A number of methods have been proposed to handle cross-platform compatibility of expression data. One approach involves mapping probes/reads to common genomic targets, then calling platform-level expression (RMA for microarray and RPKM for RNA-Seq) for each target, and finally applying quantile normalization, assuming that the expression distributions across platforms only differ by a sample-specific scaling factor. Another approach involves the application of gene-by-gene factor analysis to obtain a unified expression measure from multiple platforms using expectation-maximization (EM) algorithms. Yet another approach uses a system of functional measurement error models to model gene expression measurements and calibrate platforms using the allegedly more reliable but low-throughput qRT-PCR expressions for a subset of the genes. Like factor analysis, however, the models may only hold for an expression range that fits all three platforms and genes with extreme expressions are excluded. Still another approach involves counting the number of reads overlapping with probe regions in RNA-Seq data, estimating probe-region expressions using an empirical Bayesian approach, and subsequently applying a modified RMA algorithm (i.e., without the background correction step) on the probe-region expressions to obtain gene-level expressions. This method, however, involves more complicated computations on the mapped reads and is rigid in terms of the choice of platforms, i.e. RNA-Seq for input and RMA for output.
Given the limitations of these prior approaches, it would be desirable to have a generalized approach that supports the transformations of measurements from one gene expression platform to another.
This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description section. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.
Aspects of the present invention relate to a data-driven generalized regression-based framework that supports the transformation of measurements, applicable but not limited to gene expressions, from one platform to another over a wide dynamic range, with selected summary statistics/feature values as predictors for the model parameters. The framework consists of primary model training and transformation, and additional levels of categorical regression and transformation processes.
Embodiments of the present invention eliminate the unnecessary re-profiling of samples for the sake of combined analysis, solve the backward compatibility problem, and facilitate the adoption of new profiling technologies by allowing legacy data to be readily transformed for use with data from newer platforms. Furthermore, platform-specific gene signatures can be extended for use on the expression data from multiple platforms, either by transforming the input data to the primary platform or by adapting the parameters of a signature to the alternative platforms.
According to one aspect of the disclosure, embodiments of the present invention relate to a method for transforming gene expression data. In some embodiments, the method includes constructing a primary model utilizing sample expression data for transforming gene expression data from a first profiling platform to a second profiling platform such that the overall distribution of the transformed data resembles that of the second platform.
In some embodiments, constructing the primary model includes identifying at least one common expression between a first set of nucleic acid expression data derived using a first profiling platform and a second set of nucleic acid expression data derived using a second profiling platform, with each common expression associated with a sample present in both the first set and second set. In some embodiments, constructing the model includes performing regression analysis on the at least one common expression, resulting in one set of regression parameters for each sample. In some embodiments, constructing the model includes selecting at least one candidate feature from the first profiling platform that predicts the at least one set of regression parameters. In some embodiments, constructing the model includes identifying a primary model for sample-wise data transformation associated with each of the at least one selected candidate features. In some embodiments, constructing the model further includes generating at least one set of expression data using a profiling platform, the at least one set of expression data being at least one of the first and second sets of expression data.
In some embodiments, the method includes transforming the sample expression data using the constructed primary model. In some embodiments, the method includes constructing a categorical model by regression analysis from at least one of: (a) at least some of the transformed sample expression data and (b) at least some of the common expressions. In some embodiments, at least one of the: (a) selection of at least some of the transformed sample expression data and (b) selection of at least some of the common expressions, is based on phenotypic data or any factor known to introduce cross-platform bias. In some embodiments, the method includes iterating this process using the categorical model constructed from the transformed sample expression data to transform the transformed sample expression data and constructing another categorical model therefrom. In some embodiments, the method includes transforming a set of expression data from the first profiling platform to the second profiling platform by applying the constructed categorical models in the order of their construction.
In some embodiments, the first profiling platform or the second profiling platform is selected from the group consisting of but not restricted to Agilent Gene Expression Microarrays, Affymetrix Gene Profiling Array cGMP U133 P2/Human Genome U133 Plus 2.0/U133A 2.0, Illumina Genome Analyzer/MiSeq/NextSeq/HiSeq, NanoString nCounter SPRINT/MAX/FLEX, and Oxford Nanopore MinION/PromethION/GridION. In some embodiments, the at least one common expression is identified by at least one of matching genomic positions, matching exons, matching isoforms, and matching transcripts. In some embodiments, the at least one candidate feature is selected from the group consisting of mean transcript expression, mean normalized probe intensity, number of detected genes, total number of reads per sample, average numbers of reads per exon/gene/isoform, read coverage, and any other appropriate statistics of each sample. In some embodiments, each of the models is selected from the group consisting of a linear model, a logarithmic model, a piecewise linear model, and any other appropriate regression model.
According to another aspect of the disclosure, embodiments of the present invention relate to an apparatus for transforming gene expression data. In some embodiments, the apparatus includes a processor. In some embodiments, the apparatus includes an interface. In some embodiments, the apparatus includes computer executable instructions operative on said processor. In some embodiments, the computer executable instructions operate on said processor to construct a primary model utilizing sample expression data for transforming gene expression data from a first profiling platform to a second profiling platform such that the overall distribution of the transformed data resembles that of the second platform.
In some embodiments, the computer executable instructions for constructing the primary model comprise computing executable instructions for identifying at least one common expression between a first set of nucleic acid expression data derived using a first profiling platform and a second set of nucleic acid expression data derived using a second profiling platform, each common expression associated with a sample present in both the first set and second set. In some embodiments, the computer executable instructions for constructing a model comprise computer executable instructions for performing regression analysis on the at least one common expression, resulting in one set of regression parameters for each sample. In some embodiments, the computer executable instructions for constructing a model comprise computing executable instructions for selecting at least one candidate feature from the first profiling platform that predicts the at least one set of regression parameters. In some embodiments, the computer executable instructions for constructing a model comprise computing executable instructions for identifying a primary model associated with each of the at least one selected candidate features.
In some embodiments, the interface is configured to receive at least one set of expression data from a profiling platform, the at least one set of expression data being at least one of the first and second sets of expression data.
In some embodiments, the apparatus further includes computer executable instructions operative on said processor for transforming the sample expression data using the constructed primary model. In some embodiments, the apparatus further includes computer executable instructions operative on said processor for constructing a categorical model by regression analysis from at least one of: (a) at least some of the transformed sample expression data and (b) at least some of the common expressions. In some embodiments, at least one of the: (a) selection of at least some of the transformed sample expression data and (b) selection of at least some of the common expressions, is based on phenotypic data or any other factor known to introduce cross-platform bias. In some embodiments, the apparatus further includes computer executable instructions operative on said process for iterating this process using the categorical model constructed from the transformed sample expression data to transform the transformed sample expression data and construct another categorical model therefrom. In some embodiments, the apparatus further includes computer executable instructions operative on said processor for transforming a set of expression data from the first profiling platform to the second profiling platform by applying the constructed categorical models in the order of their construction.
In some embodiments, the first profiling platform or the second profiling platform is selected from the group consisting of but not restricted to Agilent Gene Expression Microarrays, Affymetrix Gene Profiling Array cGMP U133 P2/Human Genome U133 Plus 2.0/U133A 2.0, Illumina Genome Analyzer/MiSeq/NextSeq/HiSeq, NanoString nCounter SPRINT/MAX/FLEX, and Oxford Nanopore MinION/PromethION/GridION.
In some embodiments, the computer executable instructions for identifying at least one common expression include computer executable instructions for identifying at least one common expression by at least one of matching genomic positions, matching exomes, matching isoforms, and matching transcripts. In some embodiments, the at least one candidate feature is selected from the group consisting of mean transcript expression, mean normalized probe intensity, number of detected genes, number of reads per sample, average number of reads per exon/gene/isofrom, read coverage, and any other appropriate statistics of each sample. In some embodiments, each of the models is selected from the group consisting of a logarithmic model, a linear model, a piecewise linear model, and a regression model.
These and other features and advantages, which characterize the present non-limiting embodiments, will be apparent from a reading of the following detailed description and a review of the associated drawings. It is to be understood that both the foregoing general description and the following detailed description are explanatory only and are not restrictive of the non-limiting embodiments as claimed.
The accompanying drawings are not intended to be drawn to scale. In the drawings, each identical or nearly identical component that is illustrated in various figures may be represented by a like numeral. For purposes of clarity, not every component may be labeled in every drawing. Various embodiments of the invention will now be described, by way of example, with reference to the accompanying drawings, in which:
Cross-platform compatibility of gene expression data is a crucial and active topic of research. It can be inefficient to manage and analyze sample data originated from a mixture of platforms. For instance, the Cancer Genome Atlas (TCGA) currently has five different platforms for RNA expression: Agilent G4502A, Affymetrix HT-HG_U133A, HG-U133_Plus_2, Illumina GA and Illumina HiSeq 2000, thus making it difficult to leverage the full potential of the data through a combined analysis. Furthermore, the dynamic ranges of gene expressions can vary considerably depending on the choice of profiling platform.
With the huge amount of legacy data generated throughout the years based on former technologies, the diversity of existing platforms, and the emergence of new ones, it can be advantageous to provide compatibility of data across various platforms. Breaking platform barriers means saving the cost of re-profiling of samples in order to perform combined analysis. It can also solve the backward compatibility problem and facilitate the adoption of new profiling technologies by allowing legacy data to be readily transformed for use with data from newer platforms. Specifically, tremendous resources have been spent on microarray studies, and it is desirable to transfer the knowledge and insights from these studies onto new platforms such as next-generation sequencing (NGS) technologies.
Embodiments of the present invention facilitate cross-platform compatibility of gene expression data using models that transform expression data from one platform to another. These embodiments can also, in the clinical research setting, be applied across different cohorts available to a clinical researcher in order to evaluate many signatures on a new cohort, either by transforming the input data to the primary platform or by adapting the parameters of a signature to the alternative platforms.
With reference to
However, in some embodiments, the model construction process may include additional levels of iteration. In these embodiments, an additional model, e.g., a categorical regression model, may be constructed from transformed expression data (Step 108). This additional model may be used, in turn, to transform additional expression data (Step 104). This process may be iterated through additional rounds of categorical model construction (Step 108) and the application of those categorical models to transform expression data (Step 104), which may then be used to construct additional categorical models (Step 108) and so on. When multiple categorical models are constructed and subsequently used to transform expression data, the models are applied in the order of their construction—i.e., the first model constructed is the first model used to transform data, the second model constructed is used to transform the data transformed by the first model, and so on.
As discussed above, embodiments of the present invention typically construct a primary model for transforming data sets between platforms (Step 100). With reference to
If there is no direct mapping between the two sets of targets, the targets could be mapped from set to another by their genomic positions. For example, if the source data are RNA-Seq exon expressions and the destination data are microarray gene expressions, then exons that overlap with the microarray probe-sets can be identified and summarized into gene expressions before applying regression.
With {Si}i=1 . . . N referring to the N training samples available to construct the model for transforming gene expression data from Platform X to Platform Y, for each sample Si, regression is performed between xi and yi using expressions that are detected on both platforms (Step 204).
The target model for the regression process is assumed a priori to be defined by M parameters. Depending on the observed relationship between the training data from the source and destination platforms, any regression model can be chosen that results in the least error, such as non-linear, logarithmic, LOESS (local regression) or errors-in-variables (EiV) models. In addition, an optimization function can be applied to choose a model with the least error. This choice may be the decision of a human operator, or it may be the outcome of an automated or a semi-automated process. With an appropriate model selected, the output of the regression process is N sets of parameters ri={rk}i, k=1 . . . M.
Given the regression parameters ri for each sample Si, candidate features f are selected from the data generated by Platform X that can be good predictors for the regression parameters (Step 208). For example, if Platform X is a microarray platform, the candidates f may include mean expression, mean normalized probe intensity, etc. If Platform X is an RNA-Seq platform, the candidates f may include mean expression, number of detected genes, total number of reads, read coverage, etc. As with the choice of regression model, the identification of candidate features f may be performed by a human operator or by an automated or semi-automated process.
It is not necessary for the predictive features to be extracted only from the source data. Sometimes features from the target data may have good performance in predicting the regression parameters. In some embodiments, such target platform features may also be included in the model and, e.g., assigned with the mean value of the feature in the training data for the transformation process.
Having identified possible candidate features f from Platform X (Step 208), those features fk that actually predict the regression parameters ri, need to be identified from the set of possible candidate features f (Step 212). In one embodiment, the predictive features may be identified by means of, e.g., stepwise regression or other automated, manual, or semi-automated methods. If the goal is to select a single predictive feature for each parameter instead of a subset, the feature with the highest correlation with the parameter can be selected.
The output of the model construction process consists of the identified predictive features fk and their corresponding models yk for the prediction of the regression model parameters ri for each sample Si (Step 216).
In some embodiments, the expression data for a particular platform (e.g., xi for Platform X, yi for Platform Y, etc.) with appropriate normalization is generated for the training samples {Si} (not shown) prior to the identification of the common expressions (Step 200).
Once the primary model has been created, it may be used to transform subsequent samples from Platform X to Platform Y. Assume for the following discussion that there exists data generated on Platform X for a new sample Pn. That data includes the expression profile zn and sets of predictive feature values {vk}n that correspond to the predictive features {fk}k=1, . . . , M discussed above in connection with
With reference to
The predicted regression model parameters rn can be applied to a pre-defined regression model (Step 304), enabling the estimation of the expression profile as {circumflex over (z)}n(0)={{circumflex over (z)}g
In some embodiments, the primary model may suffice to transform expression data between profiling platforms. As discussed above, in other embodiments additional levels of categorical modeling and transformation may be employed to convert data between platforms.
In particular, if there are one or more factors that introduce additional cross-platform discrepancies, then additional levels of regression related to the factors can be performed on the data, with the transformed data from the previous level of regression serving as the input to the next level of regression.
For example, assume the existence of one factor of well-defined categories cl={cm}l, m−1, . . . , O
With reference to
With reference to
This process of categorical modeling and transformation of the training data shown in
According to one embodiment, there is provided a system and method for the transformation of gene expression data (in log2 scale) from Affymetrix GeneChip HT Human Genome U133 Array Plat Set (RMA) to Illumina HiSeq 1000 RNA-Seq (RSEM) using 545 TCGA samples that have data generated on each of the respective platforms. Some sample-wise statistics are summarized for the two platforms in Table 1. The mean correlation per sample is 0.713 and higher expressions show stronger correlation in general.
By generating scatterplots of RNA-Seq vs. microarray expressions for each sample, it can be seen that their relationship can be suitably approximated by a piecewise linear model. In an exemplary implementation using the R programming language, the ‘1m’ function for linear regression and the ‘segmented’ function of the ‘segmented’ package are applied for breakpoint (xb) estimation. This resulted in four regression parameters {m1, c1, m2, c2} for the linear models before and after the estimated breakpoint: y1=m1x1+c1 for x≦xb and y2=m2x2+c2 for x>xb. Summary statistics of the regressed piecewise linear models are summarized in Table 2 below.
Next, a small set of candidate features for predicting the four regression model parameters are generated, and it can be determined that mean expression level was a plausible single linear predictor, with moderately strong correlations of R=−0.55 with m1 and R=0.74 with c2, despite lesser correlations of R=−0.27 with c1 and R=0.19 with m2, which has a small variance of 0.04. The linear models between mean expression level and the four regression parameters are shown in
Using mean expression level as a predictor, the piecewise-linear model for each sample can be predicted.
As illustrated, for moderate-to-high microarray expressions, the predicted RNA-Seq expressions have a root-mean-square error erms=1.4, which is very close to that of 1.39 based on the estimated values obtained by direct regression. To further improve the accuracy, an additional level of regression and transformation can be applied on the primarily transformed values stratified by genes across all samples using the categorical approach as described above.
Referring to
Processor 1008 is configured as discussed above to build primary and categorical models for transforming gene expression data from a first profiling platform to a second profiling platform such that the overall distribution of the transformed data resembles that of the second platform.
Embodiments of the present invention may be extended to compute unified expressions from data measured by multiple platforms. For instance, all data may be transformed to one specific platform using, e.g., an EiV regression model, and then for each target combine the transformed values with weighted averaging using weights that are inversely proportional to the estimated noise variances of the respective source platforms.
While the above embodiments of present invention are described with respect to measurements performed on genomics platforms, the same process and procedures can be applied to physiology modelling, imaging, personal continuous health data and others.
Although above embodiments of the present invention are described with respect to gene expression data, the process and procedures described herein are applicable to solving the compatibility problem across different platforms or analytical pipelines of any numerical readings. For example, methylation levels, protein expressions or even sensor measurements, with structural discrepancies due to the inherent differences of the underlying systems.
While several embodiments of the present invention have been described and illustrated herein, those of ordinary skill in the art will readily envision a variety of other means and/or structures for performing the functions and/or obtaining the results and/or one or more of the advantages described herein, and each of such variations and/or modifications is deemed to be within the scope of the present invention. More generally, those skilled in the art will readily appreciate that all parameters, dimensions, materials, and configurations described herein are meant to be exemplary and that the actual parameters, dimensions, materials, and/or configurations will depend upon the specific application or applications for which the teachings of the present invention is/are used. Those skilled in the art will recognize, or be able to ascertain using no more than routine experimentation, many equivalents to the specific embodiments of the invention described herein. It is, therefore, to be understood that the foregoing embodiments are presented by way of example only and that, within the scope of the appended claims and equivalents thereto, the invention may be practiced otherwise than as specifically described and claimed. The present invention is directed to each individual feature, system, article, material, and/or method described herein. In addition, any combination of two or more such features, systems, articles, materials, and/or methods, if such features, systems, articles, materials, and/or methods are not mutually inconsistent, is included within the scope of the present invention.
The indefinite articles “a” and “an,” as used herein in the specification and in the claims, unless clearly indicated to the contrary, should be understood to mean “at least one.”
The phrase “and/or,” as used herein in the specification and in the claims, should be understood to mean “either or both” of the elements so conjoined, e.g., elements that are conjunctively present in some cases and disjunctively present in other cases. Other elements may optionally be present other than the elements specifically identified by the “and/or” clause, whether related or unrelated to those elements specifically identified unless clearly indicated to the contrary. Thus, as a non-limiting example, a reference to “A and/or B,” when used in conjunction with open-ended language such as “comprising” can refer, in one embodiment, to A without B (optionally including elements other than B); in another embodiment, to B without A (optionally including elements other than A); in yet another embodiment, to both A and B (optionally including other elements); etc.
As used herein in the specification and in the claims, “or” should be understood to have the same meaning as “and/or” as defined above. For example, when separating items in a list, “or” or “and/or” shall be interpreted as being inclusive, e.g., the inclusion of at least one, but also including more than one, of a number or list of elements, and, optionally, additional unlisted items. Only terms clearly indicated to the contrary, such as “only one of” or “exactly one of,” or, when used in the claims, “consisting of,” will refer to the inclusion of exactly one element of a number or list of elements. In general, the term “or” as used herein shall only be interpreted as indicating exclusive alternatives (e.g. “one or the other but not both”) when preceded by terms of exclusivity, such as “either,” “one of,” “only one of,” or “exactly one of.” “Consisting essentially of,” when used in the claims, shall have its ordinary meaning as used in the field of patent law.
As used herein in the specification and in the claims, the phrase “at least one,” in reference to a list of one or more elements, should be understood to mean at least one element selected from any one or more of the elements in the list of elements, but not necessarily including at least one of each and every element specifically listed within the list of elements and not excluding any combinations of elements in the list of elements. This definition also allows that elements may optionally be present other than the elements specifically identified within the list of elements to which the phrase “at least one” refers, whether related or unrelated to those elements specifically identified. Thus, as a non-limiting example, “at least one of A and B” (or, equivalently, “at least one of A or B,” or, equivalently “at least one of A and/or B”) can refer, in one embodiment, to at least one, optionally including more than one, A, with no B present (and optionally including elements other than B); in another embodiment, to at least one, optionally including more than one, B, with no A present (and optionally including elements other than A); in yet another embodiment, to at least one, optionally including more than one, A, and at least one, optionally including more than one, B (and optionally including other elements); etc.
In the claims, as well as in the specification above, all transitional phrases such as “comprising,” “including,” “carrying,” “having,” “containing,” “involving,” “holding,” and the like are to be understood to be open-ended, e.g., to mean including but not limited to.
Only the transitional phrases “consisting of” and “consisting essentially of” shall be closed or semi-closed transitional phrases, respectively, as set forth in the United States Patent Office Manual of Patent Examining Procedures, Section 2111.03.
Use of ordinal terms such as “first,” “second,” “third,” etc., in the claims to modify a claim element does not by itself connote any priority, precedence, or order of one claim element over another or the temporal order in which acts of a method are performed, but are used merely as labels to distinguish one claim element having a certain name from another element having a same name (but for use of the ordinal term) to distinguish the claim elements.
It should also be understood that, unless clearly indicated to the contrary, in any methods claimed herein that include more than one step or act, the order of the steps or acts of the method is not necessarily limited to the order in which the steps or acts of the method are recited.
The present application is related to co-pending U.S. provisional application No. 62/065,367, filed on Oct. 17, 2014, the entire disclosure of which is hereby incorporated by reference as if set forth in its entirety herein.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/IB2015/057952 | 10/16/2015 | WO | 00 |
Number | Date | Country | |
---|---|---|---|
62065367 | Oct 2014 | US |