The present invention relates to a method for assessing survival risk of a patient from a plurality of microarray gene expression data as samples, where the samples include both completed samples and censored samples.
There follows a list of references that are occasionally cited in the specification. Each of the disclosures of these references is incorporated by reference herein in its entirety.
An important objective of clinical cancer research is to develop tools to accurately predict the survival time and risk profile of patients based on the DNA microarray data and various clinical parameters. There are several existing techniques in the literature for performing this type of survival analysis. Among them, both Cox proportional hazards model (Cox) [1] and the accelerated failure time model (AFT) [2] have been widely used. Cox model is the most popular approach by far in survival analysis to assess the significance of various genes in the survival risk of patients through the hazard function. On the other hand, the requirement for analyzing failure time data arises in investigating the relationship between a censored survival outcome and high-dimensional microarray gene expression profiles. Therefore, the AFT model has been studied extensively in recent years. However, various current cancer survival analysis mechanisms have not demonstrated themselves to be very accurate as expected. The accuracy problems, in essence, are related to some fundamental dilemmas in cancer survival analysis. We believe that any attempt to improve the accuracy of survival analysis method has to compromise between these two dilemmas.
The first dilemma is related to the small sample size and the censoring of survival data versus high dimensional covariates in the Cox model.
High-dimensional survival analysis in particular has attracted much interest due to the popularity of microarray studies involving survival data. This is statistically challenging because the number of genes, p, is typically hundreds of times larger than the number of microarray samples, n (p>>n). For survival analysis, the sample size is further reduced significantly by the availability of follow-up data for the analyzed samples. In fact, in publicly available gene expression databases, only a small fraction of human-tumor microarray datasets provides clinical follow-up data. A “low-risk” or “high-risk” classification based on the Cox model usually relies on traditional supervised learning techniques, in which only completed data (i.e. data from samples with clinical follow-up) can be used for learning, while censored data (i.e. data from samples without clinical follow-up) are disregarded. Thus, the small sample size and the censoring of survival data remain a bottleneck in obtaining robust and accurate classifiers with the Cox model. Recently, a technique called semi-supervised learning [3] in machine learning suggests that censored data, when used in conjunction with a limited amount of completed data, can produce considerable improvement in learning accuracy. Indeed, semi-supervised learning has been proved to be effective in solving different biological problems. For example, “corrected” Cox scores were used for semi-supervised prediction using the principal component regression by Bair and Tibshirani [4] and the semi-supervised classification using nearest-neighbor shrunken centroid clustering by Tibshirani et al. [5].
The second dilemma is related to the similar phenotype disease versus different genotype cancer in the AFT model.
In the accelerated failure time model, to increase the available sample size and get the more accurate result, each censored observation time is replaced with the imputed value using some estimators, such as the inverse probability weighting (IPW) method, mean imputation method, Buckley-James method and rank-based method. In fact, these estimation methods assume that the AFT model was used for the patients with similar phenotype cancer, and the survival times should satisfy the same unspecified common probability distribution. Nevertheless, the disparity we see in disease progression and treatment response can be attributed to that the similar phenotype cancer may be completely different diseases on the molecular genotype level. Therefore, we need to identify different cancer genotypes. Can we do it based exclusively on the clinical data? For example, patients can be assigned to a “low-risk” or a “high-risk” subgroup based on whether they were still alive or whether their tumour had metastasized after a certain amount of time. This approach has also been used to develop procedures to diagnose patients [6]. However, by dividing the patients into subgroups just based on their survival times, the resulting subgroups may not be biologically meaningful. Suppose, for example, the underlying cell types of each patient are unknown. If we were to assign patients to “low-risk” and “high-risk” subgroups based on their survival times, many patients would be assigned to the wrong subgroup, and any future predictions based on this model would be suspect.
There is a need in the art to have a more accurate classification method by identifying these underlying cancer subtypes based on microarray data and clinical data together so as to build a model that can determine which subtype is present in patients.
An aspect of the present invention is to provide a computer-implemented method for selecting a significant relevant gene set correlated to a clinical variable from a plurality of microarray gene expression data as samples. The samples are separated into completed samples and censored samples. The completed samples collectively give a plurality of completed data.
The method comprises repeating an iterative process for a number of instances. When the first instance of the iterative process is executed, the plurality of completed data forms a first current set of informative data used in the execution.
The iterative process comprises the following steps:
Other aspects of the present invention are disclosed as illustrated by the embodiments hereinafter.
The approach adopted in the present invention is to strike a tactical balance between the two contradictory dilemmas mentioned above. We propose a novel semi-supervised learning method based on the combination of Cox and AFT models with L1/2 regularization for high-dimensional and low sample size biological data. In this semi-supervised learning framework, the Cox model can classify the “low-risk” or a “high-risk” subgroup though samples as many as possible to improve its predictive accuracy. Meanwhile, the AFT model can estimate the censored data in the subgroup, in which the samples have the same molecular genotype. Combined with L1/2 regularization, some genes can be selected by Cox and AFT models and they are significantly relevant with the cancer.
Before elaborating the disclosed method, we provide some backgrounds on related techniques, on the basis of all of which the disclosed method is developed.
The Cox proportional hazards model is now the most widely used for survival analysis to classify the patients into a “low-risk” or a “high-risk” subgroup after prognostic. Under the Cox model, the hazard function for the covariate matrix x={x1, x2, . . . , xi, . . . xn} with a sample size n and the number of genes p is specified as λ(t)=λ0(t)exp(β′x), where t is the survival time, β′ is the coefficient vector of x, and the baseline hazard function λ0(t) is common to all subjects, but is unspecified or unknown. Let an ordered risk set at time t(r) be denoted by Rr={j∈1, . . . , n:tj≧t(r)}. Assume that censoring is non informative and that there are no tied event times. The Cox log partial likelihood can then be defined as
where D denotes the set of indices for observed events.
The AFT model is a linear regression model for survival analysis, in which the logarithm of response ti is related linearly to covariates xi:
h(ti)=β0+x′iβ+εi, i=1, . . . , n, (2)
where h(·) is the log transformation or some other monotone function. In this case, the Cox assumption of multiplicative effect on a hazard function is replaced with the assumption of multiplicative effect on an outcome. In other words, it is assumed that the variables xi's act multiplicatively on time and therefore affect the rate at which individual i proceeds along the time axis. Because censoring is present, the standard least squares approach cannot be employed to estimate the regression parameters in (2) even when p<n. One approach for AFT model implementation entails the replacement of censored ti with imputed values. One such approach is that of mean imputation in which each censored ti is replaced with the conditional expectation of tj given tj>ti[7]. The imputed value h(ti) can then be given by
where Ŝ is the Kaplan-Meier estimator (Kaplan and Meier (1958), Nonparametric estimation from incomplete observations, Journal of the American Statistical Association, Vol. 53, pp. 457-81) of the survival function and where ΔŜ(t(r)) is the step of Ŝ at time t(r). Ref. [8] also assessed the performance of several approaches to AFT model implementation, including reweighting the observed ti, replacement of each censored ti with an imputed observation, drawn from the conditional distribution of t (multiple imputation), and mean imputation. They found that the mean imputation approach outperformed reweighting and multiple imputation under the lasso penalization in the high-dimensional and low-sample size setting.
In recent years, various regularization methods for survival analysis under the Cox and AFT models have been proposed, which perform both continuous shrinkage and automatic gene selection simultaneously. For example, Cox-based methods utilizing kernel transformations [9], threshold gradient descent minimization [10] and lasso penalization [11] have been proposed. Likewise, some researchers have proposed variable selection methods based on accelerated failure time models. Most of these procedures are based on the L1-norm, however, the results of L1 regularization are not good enough for sparsity, especially in biology research. Theoretically, the Lq (0<q<1) type regularization with a lower value of q would lead to better solutions with more sparsity. Moreover, among Lq regularizations with q∈(0,1), only L1/2 and L2/3 regularizations permit an analytically expressive thresholding representation. The inventors' previous works have also demonstrated the efficiencies of L1/2 regularization for the Cox and AFT models, respectively [12]. The sparse L1/2 regularization model is expressed as:
where l is a loss function and λ is a tuning parameter. Since the penalty function of L1/2 regularization is non-convex, which raises numerical challenges in fitting the Cox and AFT models. Recently, coordinate descent algorithms [13] for solving non-convex regularization approach (such as SCAD, MCP) have been shown to have significant efficiency and convergence [14]. The algorithms optimize a target function with respect to a single parameter at a time, iteratively cycling through all parameters until reaching convergence. Since the computational burden increases only linearly with the number of the covariates p, coordinate descent algorithms can be a powerful tool for solving high-dimensional problems.
Therefore, in this work, we introduce a novel univariate half thresholding operator of the coordinate descent algorithm for the L1/2 regularization, which can be expressed as:
where {tilde over (y)}i(j)Σk=jxikβk is the partial residual for fitting βj, ωj=Σi=lnxij(yi−{tilde over (y)}i(j)), and
Remark: In previous work [15], we used ¾λ2/3 for representing L1/2 regularization thresholding operator. Here, we introduce a new half thresholding representation
This new value is more precisely and effectively than the old one. Since it is known that the quantity of the solutions of a regularization problem depends seriously on the setting of the regularization parameter λ. Based on this novel thresholding operator, when λ is chosen by some efficient parameters tuning strategy, such as cross-validation, the convergence of the algorithm is proved [16].
In the semi-supervised learning framework as disclosed herein, the predictive accuracy of the Cox and AFT models would be improved because the number of the training data increased and the censored data were imputed reasonably. The L1/2 regularization approach can select the significant relevant gene sets based on the Cox and AFT models respectively.
In the disclosed semi-supervised learning method, the censored data are evaluated from the same risk class to improve prediction performance. However, there are some observable errors in the imputations of the censored data. For example, the estimated survival time by the AFT model was even less than the censored time. We regarded them as error estimations, and did not use them for model training.
To evaluate the performance of our proposed semi-supervised learning method based on Cox and AFT models with L1/2 regularization, we adopted the simulation scheme in R. Bender's work [17]. The generation procedure of the simulated data is as follows.
Step 1: We generate γi0, γi1, . . . , γip(i=1, . . . , n) independently from a standard normal distribution and set: Xij=γij√{square root over (1−ρ)}+γi0√{square root over (ρ)}(j=1, . . . , p) where ρ is the correlation coefficient.
Step 2: The survival time yi is written as:
in which U is a uniformly distributed variable, ω is the scale parameter, and α is the shape parameter.
Step 3: The censoring time point y′i (i=1, . . . , n) is obtained from a random distribution E(θ), where θ is determined by a specified censoring rate.
Step 4: Here we define yi=min(yi, y′i) and δi=1,if yi<y′i, else δi=0, the observed data represented as (yi, xi, δi) for the model are generated.
In our simulated experiments, we build high-dimensional and low sample size datasets. In every dataset, the dimension of the predictive genes is p=1000, in which 10 prognostic genes and their corresponding coefficients are nonzero. The coefficients of the remaining 990 genes are zero. About 40% of the data in each subgroup are right censored. We considered the training sample sizes are n=100, 200, 300 and the correlation coefficients of genes are ρ=0 and ρ=0.3 respectively. The simulated data were applied to the single Cox, single AFT and semi-supervised learning approach with Cox and AFT models. For gene selection, we use L1/2 regularization approach and the regularization parameters are tuned by 5-fold cross validation. To assess the variability of the experiment, each method is evaluated on a test set including 200 samples, and replicated over 50 random training and test partitions.
The classification accuracy under the correlation coefficient ρ=0.3 with different training sample size setting was demonstrated in
The precision of our semi-supervised learning model with L1/2 regularization is given in Table 1.
The precision is got from the number of correct selected genes divided the total number of selected genes by the methods. With the sample size increase or the correction coefficients of the features decrease, the classification performances of each model become better. We found the single Cox and single AFT model is difficult to select the whole correct genes in the dataset. This means these models selected too few corrected genes and many other irrelevant genes in their results. This made their prediction precision very low. Nevertheless, our semi-supervised learning model solve this problem, the precision of the semi-Cox or the semi-AFT group were both higher than that obtained by the single Cox or single AFT model. After processed by our semi supervised learning method, the number of selected correct genes was increased, and the number of total selected genes was decreased, the semi-Cox achieved about 130% improvements in precision compared to the single Cox model. Although the precision improvement of semi-AFT model is smaller than that of the semi-Cox model, it can select most correct genes under different parameter settings. Therefore, we think our semi-supervised learning method can significantly improve the accuracy of prediction for survival analyses with the high-dimensional and low sample size gene expression data.
In this section, the disclosed semi-supervised learning approach was applied to the four real gene expression datasets respectively, such as DLBCL(2002) [18], DLBCL(2003) [19], Lung cancer [20], AML [21]. The brief information of these datasets is summarized in Table 2.
In order to accurately assess the performance of the semi-supervised learning approach, the real datasets were randomly divided into two pieces: two thirds of the available patient samples, which include the completed and correct imputed censored data, were put in the training set used for estimation and the remaining completed and censored patients' data would be used to test the prediction capability. We used single Cox and single AFT with L1/2 regularization approaches for comparisons. For each procedure, the regularization parameters are tuned by 5-fold cross validation. All results are averaged over 50 repeated times respectively.
The integrated brier score (IBS) and the concordance index (CI) measurements were used to evaluate the classification and prediction performance of Cox and AFT models in the semi-supervised learning approach.
The Brier Score (BS) [22] is defined as a function of time t>0 by:
where Ĝ(·) denotes the Kaplan-Meier estimation of the censoring distribution and Ŝ(·|Xi) stands to estimate survival for the patient i. Note that the BS(t) is dependent on the time t, and its values are between 0 and 1. The good predictions at the time t result in small values of BS. The IBS is given by:
The IBS is used to assess the goodness of the predicted survival functions of all observations at every time between 0 and max(ti).
The CI can be interpreted as the fraction of all pairs of subjects which predicted survival times are correctly ordered among all subjects that can actually be ordered. By the CI definition, we can determine ti>ti when ƒi>ƒl and δl=1 where ƒ(·) is a survival function. The pairs for which neither ti>tj nor ti<tj can be determined are excluded from the calculation of the CI. Thus, the CI is defined as
Note that the values of CI are between 0 and 1, and that the perfect predictions of the building model would lead to 1 while have a CI of 0.5 at random.
As shown in
As shown in
In
On the other hand, we find that for these all four gene expression datasets, the selected genes from the Cox and AFT models are quite different and just small parts of them are overlapping. We think that the reason may be that the Cox model selects the relevant genes for low-risk and high-risk classification. Nevertheless, the genes selected by the AFT model are highly correlative with the survival time of patients. Therefore, these two models may select different genes, which have different biological functions. Through our below analyses, we know that the genes selected by semi-supervised learning methods are significantly relevant with cancer.
In this section, we introduce a brief biological analysis of the selected genes for the Lung cancer dataset to demonstrate the superiority of our proposed semi-supervised learning method. The number of selected genes by semi-supervised learning method is less than the single Cox and AFT model, but includes some genes which are significantly associated with cancer and cannot be selected by the two single Cox and AFT models, such as GDF15, ARHGDIB and PDGFRL. GDF15 belongs to the transforming growth factor-beta superfamily, and is one kind of bone morphogenetic proteins. It was showed that GDF15 can be seen as prognostication of cancer morbidity and mortality in men [23]. ARHGDM is the member of the Rho (or ARH) protein family; it is involved in many different cell events such as cell secretion, proliferation. It is likely to impact on the cancer [24]. The role of PDGFRL is to encode a protein contains an important sequence which is similar to the ligand binding domain of platelet-derived growth factor receptor beta. Biological research has confirmed that this gene can affect the sporadic hepatocellular carcinomas. This suggests that this gene product may get the function of the tumor inhibition.
At the same time, the Cox and AFT models with and without semi-supervised learning method also selected some common genes, e.g., the PTP4A2, TFAP2C and GSTT2. PTP4A2 is the member of the protein tyrosine phosphatase family. Overexpression of PTP4A2 will confer a transformed phenotype in mammalian cells, suggesting its role in tumorigenesis [25]. TFAP2C can encode a protein contains a sequence-specific DNA-binding transcription factor which can activate some developmental genes [26]. GSTT2 is one kind of a member of a superfamily of proteins. It has been proved to play an important role in human carcinogenesis and shows that these genes are linked to cancer with a certain relationship [27].
Through the comparison of the biological analyses of the selected genes, we found the semi-supervised method based on Cox and AFT models with L1/2 regularization is a competitive method compared to single regularized Cox and AFT models.
The present invention is developed based on our proposed semi-supervised learning framework as disclosed above. An aspect of the present invention is to provide a computer-implemented method for selecting a significant relevant gene set correlated to a clinical variable from a plurality of microarray gene expression data as samples. The samples are separated into completed samples and censored samples. The completed samples collectively give a plurality of completed data.
The method comprises repeating an iterative process for a number of instances. In a start-up stage, namely, when the first instance of the iterative process is executed, the plurality of completed data forms a first current set of informative data used in the execution.
Exemplarily, the iterative process comprises the following steps.
Each first imputed value and each second imputed value may be determined according to a mean imputation approach. Regularization parameters used in the L1/2 regularized Cox model and the L1/2 regularized AFT model may be tuned by a stratified K-fold cross-validation. Preferably, a univariate half thresholding operator of a coordinate descent algorithm for L1/2 regularization is used in the L1/2 regularized Cox model and the L1/2 regularized AFT model. The univariate half thresholding operator is given by (5).
Although the method is advantageously usable to risk survival assessment for the patient with cancer, the present invention is not limited only to cancer but can be applied to other diseases.
The embodiments disclosed herein may be implemented using general purpose or specialized computing devices, computer processors, or electronic circuitries including but not limited to digital signal processors (DSP), application specific integrated circuits (ASIC), field programmable gate arrays (FPGA), and other programmable logic devices configured or programmed according to the teachings of the present disclosure. Computer instructions or software codes running in the general purpose or specialized computing devices, computer processors, or programmable logic devices can readily be prepared by practitioners skilled in the software or electronic art based on the teachings of the present disclosure.
The present invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The present embodiment is therefore to be considered in all respects as illustrative and not restrictive. The scope of the invention is indicated by the appended claims rather than by the foregoing description, and all changes that come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.
This application claims the benefit of U.S. Provisional Patent Application No. 62/197,031, filed on Jul. 26, 2015, which is incorporated by reference herein in its entirety.
Number | Date | Country | |
---|---|---|---|
62197031 | Jul 2015 | US |