The present disclosure relates generally to precision treatment of cancer tissue and therapeutic biomarker screening, and, more particularly, to techniques screening biomarkers based on an overlapping cluster grouping technique.
The background description provided herein is for the purpose of generally presenting the context of the disclosure. Work of the presently named inventor, to the extent it is described in this background section, as well as aspects of the description that may not otherwise qualify as prior art at the time of filing, are neither expressly nor impliedly admitted as prior art against the present disclosure.
The use of drugs to selectively target specific genetic alterations in defined patient subpopulations has seen significant successes. One example can be found in the treatment of chronic myeloid leukemia (CML) where the first consistent chromosomal abnormality associated with a human cancer was identified back in the 1960s. Fast forward to the 1980s where the consequence of this abnormality was discovered to be the production of an abnormal gene called BCR-ABL. Intense drug discovery programs were initiated to shut down the activity of BCR-ABL, and in 1992, imatinib (Gleevec) was developed. In 1998, the drug was tested in CML patients who had exhausted standard treatment options and whose life expectancy was limited, with remarkable results in their blood counts returning to normal. In 2001, the FDA approved imatinib. The result, a once commonly fatal cancer now has a five-year survival rate of 95% (Druker, Guilhot, O'Brien, Gathmann, Kantarjian, Gattermann, Deininger, Silver, Goldman, Stone et al. 2006).
Achievements like this largely inspire today's high throughput studies of linking cancer drugs (known or in development) to specific genomic changes which could be used as therapeutic biomarkers. The hope is that such analyses will shed light on biological mechanisms underlying drug sensitivity, tumor resistance and potential drug combination synergies. Such discoveries not only motivate development of precision cancer therapies, improvements of current therapies, and explorations of new therapeutic avenues, they also guide early developments of new cancer drugs (Garnett, Edelman, Heidorn, Greenman, Dastur, Lau, Greninger, Thompson, Luo, Soares et al. 2012; Barretina, Caponigro, Stransky, Venkatesan, Margolin, Kim, Wilson, Lehar, Kryukov, Sonkin et al. 2012).
Cancer cell lines have frequently been used as a convenient way of conducting such studies. For a systematic search of therapeutic biomarkers to a variety of cancer drugs, Garnett et al. (2012) screened 639 human tumor cell lines which represent much of the tissue-type and genetic diversity of human cancers with 130 drugs. These drugs, including approved drugs, drugs in development as well as experimental tool compounds, cover a wide range of targets and processes involved in cancer biology. In the Garnett et al. work, a range of 275-507 cell lines were screened for each drug. The effect of a 72 h drug treatment on cell viability was used to derive measures of drug sensitivity, including the half-maximal inhibitory concentration (IC50). In addition, the cell lines underwent sequencing of 64 known cancer genes, genome-wide analysis of copy number gains and losses, and expression profiling of 14,500 genes.
In analysis of this data, Garnett et al. used multivariate analysis of variance (MANOVA) to examine pairwise drug-(genomic) feature associations. This process largely limits knowledge discovery in that: 1) only categorical features can be used as a result over 95% of the features were not examined; 2) the drug-feature associations can change dramatically by considering other features' impacts, hence the marginal associations rarely reflect true relationships. It is more likely that sensitivity of cancer cells to drugs depends on a multiplicity of genomic and epigenomic features with potential interactions. Therefore linear regression with an elastic-net penalty (Zou and Hastie 2005) was further applied to each drug to identify significantly related features while adjusting for other effects.
However given the degree of complexity of the data, this later elastic-net regression applied for each drug was far from sufficient. Firstly, the 639 tumor cell lines came from a variety of cancer tissue types, hence there is likely additional heterogeneity manifested as subpopulations in data. Secondly, the 130 drugs are hardly independent with each other. It has long been recognized that one can improve the prediction performance by modeling with multiple response variables (Breiman and Friedman 1997).
As a result of the foregoing shortcomings in the state of the art, the present application addresses some direct questions of interest. These include, i) can cancer-specific therapeutic biomarkers be detected, ii) can drug resistance patterns be identified along with predictive strategies to circumvent resistance using alternate drugs, and iii) can biomarkers of combination therapies be identified to help predict synergies in drug activities?
To address these questions a number of statistical challenges exist: i) therapeutic biomarkers of drug sensitivity cluster among the cell lines; ii) clusters can overlap (e.g. a cell line may belong to multiple clusters); and iii) can a multivariate framework be developed so that drugs can be modeled jointly.
The present application describes new statistical modeling techniques that address these issues using a finite mixture of multivariate regression (FMMR) model generalized to enable overlapping clustering and the numerical solutions of drug data, genomic data, biomarker data, demographic data, pathology data (e.g., cell line data, etc.).
In accordance with an example, a method of identifying alternative drug therapies, the method comprises: receiving, at a processing unit, drug responsiveness data for a treatment drug applied to a plurality of different cell lines, each cell line differing in genomic profile from one another such that the genomic profiles for the plurality of different cell lines are mutually exclusive, the drug responsiveness data being determined from live cell testing using the treatment drug; identifying at least two cell lines from the plurality of different cell lines, and determining, at the processing unit, an overlap cell line cluster where one of the at least two cell lines has a high drug responsiveness value and another of the at least two cell lines has a low drug responsiveness value, and wherein the overlap cell line cluster has a drug responsiveness value for the treatment drug that is between the high drug responsiveness value and the low drug responsiveness value, wherein determining the overlap cell line cluster using a generalized finite mixture of multivariate regression (FMMR) model; performing, at the processing unit, an optimality algorithm process on data for a cluster of drugs to identify one or more alternative drugs exhibiting a drug responsiveness value between the high drug responsiveness value and the low drug responsiveness value; and administering to cells of one of the at least two cell lines the one or more alternative drugs.
In accordance with another example, a method of grouping drugs for treating a pathology, the method comprises: receiving, at a processing unit, drug responsiveness data for a set of drugs and for a set of cell lines; grouping, using the processing unit, each of the drugs into a different drug grouping, using (i) a generalized finite mixture of multivariate regression (FMMR) model as follows
wherein f(yi|xi, Bl
The figures described below depict various aspects of the system and methods disclosed herein. It should be understood that each figure depicts an embodiment of a particular aspect of the disclosed system and methods, and that each of the figures is intended to accord with a possible embodiment thereof. Further, wherever possible, the following description refers to the reference numerals included in the following figures, in which features depicted in multiple figures are designated with consistent reference numerals.
The present techniques provide computer-implemented processes for precision therapeutic biomarker screening for cancer. The techniques use a model-based overlapping clustering framework to assess large numbers of possible drugs and drug combinations against patient data, including cell line responsiveness, genetic data, phenotype data, demographic data, etc. A multivariate regression model has been developed, along with a latent overlapping cluster indicator variable. The techniques employ a new finite mixture of multivariate regression (FMMR) model and EM algorithm for modeling. The techniques have been used to analyze large amounts of drug data and have identified complex overlapping drug clusters, as well as cluster-wise drivers that facilitate identification of new drugs for treating pathologies, such as cancer, in patients.
Clustering is a well-established statistical technique which involves grouping data elements according to a measure of similarity. Traditional clustering techniques generate partitions so that each data element belongs to one and only one cluster. It has long been recognized that such an ideal clustering seldom exists in real data. It is more likely that clusters overlap in some parts.
Some statistical methods have been developed to handle the overlapping clustering problem. Lazzeroni and Owen (2002), for example, put forward the so-called plaid model for two-sided overlapping clustering of gene expression data. However, the numerical solutions proposed in the paper can produce unsatisfactory cluster retrieval results (see Turner, Bailey and Krzanowski (2005) and Turner, Bailey, Krzanowski and Hemingway (2005) for improved and extended (to structured data) plaid models, and see, Zhang (2010) for the Bayesian plaid model formulation). Segal, Battle and Koller (2002) introduced a probabilistic model for one-sided overlapping clustering. Banerjee, Krumpelman, Ghosh, Basu and Mooney (2005) later generalized this probabilistic model to any regular exponential family distribution. However both of the methods suffer from computational complexity in estimating a binary membership matrix M∈{0, 1}n×K, which is an integer optimization problem.
An alternative “naive” overlapping clustering method is the finite mixture (FM) model with a hard threshold a on cluster membership probabilities. This approach assigns a data point to a cluster k if P(zik=1|Xi, Θ>α and hence enables an item to belong to multiple clusters.
As pointed out by Banerjee et al. (2005), this conventional FM method is problematic because: 1) it is hard to choose the value of a; and 2) it is not a natural generative model for overlapping clustering since one underlying assumption of the mixture model is that each object comes from only one mixture component. Later Fu and Banerjee (2008) introduced an overlapping clustering approach based on the multiplicative mixture model (Heller and Ghahramani 2007). However, this model ended up with the same computational issue as in Banerjee et al. (2005).
Some have focused on investigating the relationship between response variables Y∈q and covariates X∈p by fitting a regression model rather than to explore the X or Y on its own. In the context of regression analysis for q=1, finite mixture of regression (FMR) models are commonly used to capture unobserved cross-sectional heterogeneity in the data (see, e.g., Jedidi, Ramaswamy, DeSarbo and Wedel 1996). The FMR model postulates that a sample of observations come from a finite mixture of latent sub-populations with each sub-population represented by a specific regression model.
In 1972, Quandt proposed a two component mixture of univariate Gaussian linear regression model with the idea of using maximum likelihood function to estimate model parameters. Subsequently, the well-known expectation-maximization (EM) algorithm was used for maximum likelihood estimation (MLE) with incomplete data, revolutionizing studies of the FMR model, with various researchers contributing the advancing efforts.
However, looking at each of these conventional techniques, the results insufficient for true therapeutic applications, especially as data sets grow in size and complexity. Notably, none of these conventional techniques offer a truly generalized model, one that provides for overlapping (i.e., clustering) data sets for optimized variable selection. Data distributions depart from expected normality. In terms of statistical accuracy, breakdowns occur from forced approximations, such as setting mixture components as a constant. But most importantly, none of the above listed methods allow for simultaneous variable selection. In order to do variable selection, these data models were constrained to the point where response variables were treated as independent from one another. Yet, clinical testing shows, especially for pathologies like various cancers, that selection variables can be interdependent. Therefore, these proposed techniques are not suitable for cancer drug screening.
Overcoming these and other limitations of conventional techniques, with the advanced techniques described herein, we are able to show that biomarkers indicative of a first pathology, say soft tissue cancer, may well have demonstrated drug treatments that alone or combined with other drug treatments may be used to treat a second pathology, such as cancer in the bone or kidney.
The present techniques provide, among other things, a computer-implemented process using a model-based overlapping clustering framework. As detailed in examples below, a multivariate regression model is provided and a latent overlapping cluster indicator variable is introduced. The present techniques employ a new finite mixture of multivariate regression (FMMR) model and expectation-maximization (EM) algorithm to fit the new model. Penalized likelihood with a lasso, Elastic-Net, or other penalty function may be used to estimate model parameters meanwhile allowing variable selection. Simulation studies show excellent finite-sample performance that outperforms other methods. Analysis of drug data, for example, has found complex overlapping drug clusters as well as cluster-wise drivers that facilitate identification of new drugs for treating pathologies such as cancer.
As we discuss, the present techniques are able to assess multiple variables across cancer cell line targets to record drugs and drug combinations for treatment.
For a sample (of cell lines) of n observations, denote yi=(yi1, . . . , yiq)T∈q a vector of responses (IC50 values), xi=(xi1, . . . , xip
for i=1, . . . , n. The following allows overlapping clustering of cell lines within the context of a multivariate regression model,
where K is the total number of clusters, Bk is a pn xq coefficient matrix for the kth cluster, and is Pik∈{0, 1} if observation i belongs to the kth cluster, otherwise 0. In traditional clustering problem, it assumes that each observation belongs to exactly one cluster, namely Σk=1K Pik=1 for all i. Therefore we allow Σk=1K Pik≥1 so that each observation can belong to multiple clusters.
We provide some interpretation of the clusters in model (2.1). A cluster k in the model contains a subset of observations for whom Pik=1. We will assume that not all genomic feature are relevant in describing the cluster of observations and thus will assume a sparse coefficient matrix Bk. Since the sparse pattern can vary with k, each cluster can be represented by a unique set biomarkers. For samples belonging to multiple clusters, their response variables are explained by multiple sets of biomarkers, indicative of involving several biological processes simultaneously.
We examined two approaches for fitting (2.1): a generalized finite mixture of multivariate regressions model and a generalization of the plaid model previously discussed.
Finite Mixture of Multivariate Linear Regression
When Σk=1K Pik=1, model (2.1) can be characterized by a hierarchical structure which ends up with an FMMR model. Consider a latent cluster membership random variable zi for observation i. Given Σk=1K Pik=1, the range of zi equals to {1, . . . , K}. Assume that P (zi=k)=πk for each k. By properties of probability, it satisfies that Σk=1K πk=1 and πk≥0. Moreover for model (2.1), it has
(yi|zi=k,xi,Bk,Σ˜Nq(BkTxi,Σ). (3.1)
Denote Θ=(B1, . . . , BK, Σ, π) with π=(π1, . . . , πK-1)T. We can derive the joint density of yi and zi as
where f(yi|xi, Bk, Σ, π) is a multivariate normal density function defined in (3.1).
Summarizing (3.2) over zi=1, . . . , K yields the FMMR model
When using the EM algorithm to estimate model (3.3), (xi; yi) is regarded as an incomplete sample observation for missing cluster membership variable zi, one instead deals with the joint density of (3.2) with complete data. The complete (conditional) log-likelihood of 0 given a sample of n independent observations from model (2.1) becomes
where zik=I(zi=k).
Variable selection, to obtain a parsimonious model, is essential especially to high dimensional statistical modeling for enhanced scientific discovery and improved prediction accuracy (Fan and Li 2006). Khalili and Chen (2012) introduced a penalized likelihood approach for variable selection in FMR models, which was shown to be variable selection consistent and computationally much more efficient than all-subset selections. Similarly, we derived a penalized log-likelihood of Θ as
where
The following two penalty functions were used to meet different demands in variable selection:
1) the L1-penalty in LASSO (Tibshirani 1996) for simultaneous estimation and variable selection
ρnk(Bk)=λk∥Bk∥1
2) a linear combination of the L1- and L2-penalty in Elastic-Net (Zou and Hastie, 2005) for simultaneous estimation and selection of grouped features
ρnk(Bk)=λk1∥Bk∥1+λk2∥Bk∥22,
where ∥⋅∥1 and ∥⋅∥2 are respectively the L1- and L2-norms and the tuning parameters λk, λk1λk2≥0.
In next section, revised FMMR model and EM algorithm are devised for overlapping clustering multivariate regression data.
Generalized FMMR Model and EM Algorithm for Overlapping Clustering
The FMMR model solves (2.1) given Σk=1K Pik=1 for each i,
for each i, wherein the clusters are partitional. When the clusters are overlapping, namely Σk=1K Pik≥1, we propose a generalized FMMR model to address the issue, meanwhile this generalized model preserves the ability in fitting partitions.
Suppose we have K objective clusters indexed by 1 to K. Note that these clusters can overlap with each other, resulting in 2K−1 types of overlapping patterns. Let K equal to 3 for an example, the overall overlapping patterns are composed of S={1; 2; 3; 12; 13; 23; 123}. Here element 12 in S represents exclusively the overlapping part of objective clusters 1 and 2, whereas element 1 represents a subset of objective cluster 1 which does not have overlappings with all other objective clusters.
We define 2K−1 hypothetical clusters. Each hypothetical cluster (called “cluster” herein) indicates an overlapping pattern defined above and is indexed by an element in T wherein
T=∪s=1K{(l1 . . . ls):{l1, . . . ,ls}⊆{1, . . . ,K}},
Cluster (l1 . . . ls) implies that its members belong to objective clusters l1, . . . , ls.
We introduce a latent cluster membership random variable zi for observation i from (2.1), and characterize model (2.1) by a hierarchical structure. Given Σk=1K Pik≥1, the range of zi becomes the set T. We further postulate that
P(zi=l1 . . . ls))=π(l
By properties of probability, vector π=(π(l
Moreover for (2.1), it amounts to Pil
Let Θ=(B1, . . . , BK, Σ, π). By (3.6) and (3.7), the joint density of yi and zi equals to
where f(yi|xi, Bl
By summarizing (3.8) over zi, we obtain a generalized FMMR model
Note that if π(l
For a sample of n independent observations from model (2.1), the complete (conditional) log-likelihood of 0 becomes
where zi,l(l
We derived a penalized log-likelihood of Θ which enforces sparsity in genomic features.
where pnk (Bk) utilizes one of two penalty functions, the LASSO penalty (ρnk(|Bk|)=λk∥Bk∥1) or the Elastic-Net penalty
for grouped features as defined herein. The notation
implies that it summarizes over (l1 . . . ls)ϵT for which kϵ{l1, . . . , ls}. In (3.10), Bk is a coefficient matrix corresponding to the kth objective cluster. The penalty imposed on Bk is proportional to
which depicts the proportion of observations involving in the kth objective cluster. This is a strategy of relating the penalty to sample sizes similar to Khalili and Chen (2012) for enhanced power of the method.
Numerical Solution to the Generalized FMMR Model
A new expectation maximization (EM) algorithm to maximize the penalized log-likelihood in (
E-step: Given Θm, estimate i,(l
for i=1, . . . , n and (l1 . . . ls)ϵT.
M-step: Given Zm=(i,(l
for (l1 . . . ls)ϵT. Note that this is obtained by maximizing the leading term of (3.10) with respect to π for simplicity. This simplified updating scheme however works well in simulation studies.
Given Zm, πm+1 and Σm, sequentially update Bkm+1|{Bs,s≠km} by
for k=1, . . . , K.
According to (3.7), yi for i=1, . . . , n are i.i.d. in multivariate normal. Hence (3.11) becomes
Optimization of (3.12) is a multivariate regression problem with a LASSO penalty for estimation, which can be solved by the MRCE algorithm (Rothman, Levina and Zhu 2010). If we ignore the covariance structure of Σm in (3.12), the optimization reduces to q independent LASSO regression problems. In this case, more complex penalty functions say the Elastic-Net or the fused LASSO penalty can be easily applied. Therefore we also investigated the performance of this simplified strategy in simulation studies, and found the present techniques work surprisingly well comparable, in terms of clustering accuracy, to the original method utilizing the MRCE algorithm.
By taking the derivative of (3.13) according to Σ−1, we get
where Σ(l
Although we can also penalize the inverse covariance matrix Σ−1 in (3.13) like in Rothman et al. (2010), our simulations have shown that penalizing the Σ−1 results in a lower degree of clustering accuracy than not penalizing the Σ−1. This is because by penalizing the Σ−1 one introduces bias in its estimation, and a biased estimate of Σ−1 in return deteriorates the estimate of zi; (l1 . . . ls) in the E-step. Thus we choose to not penalize the Σ−1.
Commencing with an initial value Θ0, the present models may iterate between the E- and M-steps until the relative change in log-likelihood, |lnm+1({circumflex over (Θ)})−lnm({circumflex over (Θ)}))/(lnm({circumflex over (Θ)})|, is smaller than some threshold value, taken as 10−5 in simulation studies and 10−3 in real data analysis. Additionally, a cluster, whose mixing proportion is smaller than some threshold value taken as 0.01 in the paper, was removed during iterations to avoid over estimations.
Selection of the Tuning Parameters and K
In preceding penalized likelihood approach, the techniques can choose the sizes of component-wise tuning parameters λk for k=1, . . . , K, which controls the complexity of an estimated model (Fan and Lv 2010). The data-driven method cross-validation (CV) (Stone 1977) is frequently adopted in literature (Tibshirani 1996; Zou and Hastie 2005). In an example, we used a component-wise 10-fold CV method for optimal tuning parameter selection in the expression (3.12).
Selection of the number of components K is performed in finite mixture models (including the FM, FMR and FMMR). In applications, the choice can be based on prior knowledge of data analysts. With respect to formatted methodologies, information criteria (IC) remains by far the most popular strategy for selection of K. See Fraley and Raftery (1998), Anderson and Burnham (2002) and Claeskens, Hjort et al. (2008) for general treatments on this topic. Here we choose the K that minimizes the IC,
ICn(K)=−2ln(Θ)+NKan. (3.14)
In (3.14), Nk is the effective number of parameters in the model with
NK=|{Bk,k=1, . . . ,K}|+|π|−1+|Σ|,
where |A| calculates the number of nonzero elements in A. Also in (3.14), an is a positive sequence depending on n. The well-known AIC (Akaike 1974) and BIC (Schwarz et al. 1978) correspond to an=2 and an=log(n) respectively. It has been shown (in Keribin (2000)) that under general regularity conditions, BIC can correctly identify the order of an FM model as n increases to infinity. The feasibility of their results to the FMR or FMMR models is however unknown. However, we have examined the performance of BIC for selection of K in an generalized FMMR model via simulations in supplementary material (Exhibit A, attached herewith). It obtains a high degree of accuracy in identifying the true K in our new model.
Generalization of the Plaid Model
To make comparisons with the generalized FMMR model introduced herein, we generalize the plaid model by introducing covariates into the model such that it can as well cluster overlapping multivariate regression data. To date, none of the conventional plaid model techniques have been conducted on overlapping clustering for (univariate/multivariate) regression data to best of our knowledge. The plaid model and probabilistic model are the two most popular methods for overlapping clustering ordinary data. The reason to select the plaid model rather than the probabilistic model for comparison is because the latter one revolves around Bayesian model like ours. The plaid model however brought in the concept of layers (subsets of rows and columns of a data matrix), wherein a data matrix is decomposed as the sum of the layers. Model (2.1) is the generalized plaid model. It seeks to find the parameter estimators that minimize the Q
where pnk (Bk) is the penalty function defined herein.
Detailed procedures for optimizing Q are shown in Algorithm 1 in
Asymptotic Properties and Some Simulation Studies
Asymptotic Properties
Khalili and Lin (2013) studied the penalized likelihood approach for variable selection and estimation in finite mixture of regression models with a diverging pn. It was shown that this approach leads to consistent estimation as well as variable selection under certain regularity conditions. Let f(w; Θ) be the joint density function of w=(x; y) with parameter space ΘϵΩ. The conditional density function of y given x follows (3.9). Since the regularity conditions given for asymptotic establishments in Khalili and Lin (2013) are made on pnk(Bk) and f(w; Θ) directly, corresponding theoretical properties can be extended to current problem.
Denote by Bkj the jth column of Bk for j=1, . . . , q. Let Bkj=(Bk1j, Bk2j) to divide the coefficient vector into non-zero and zero subsets. Denote by Θ0 the true value of Θ, Θ0=(Θ10, Θ20) is the corresponding decomposition such that Θ20 contains all zero coefficients, i.e. Bk2j, and {circumflex over (Θ)}n=({circumflex over (Θ)}n1, {circumflex over (Θ)}n2) is the estimate. Since the dimension of {circumflex over (Θ)}n1 increases with n as pn is diverging, we investigate the asymptotic distribution of a finite linear transformation, Dn{circumflex over (Θ)}n1, where Dn is a constant l×dn1 matrix with a finite l, and dn1 is the dimension of Θ10. Moreover DnDNT→D and D is a positive definite symmetric matrix. In the theory, we assume that the order K is independent of the sample size n and also is known beforehand. Strategies for its selection in applications are discussed herein and performance of the selection is studied in simulations.
Theorem 1. Suppose the penalty function pnk(Bk) satisfies conditions P0-P2, and the joint density function f(w; Θ) of a random sample w=(x; y) satisfies conditions R1-R5.
I. If
then there exists a local maximizer {circumflex over (Θ)}n for the penalized log-likelihood {tilde over (l)}n(Θ) with
where
II. If ρnk(Bk) also satisfies condition 3 and
for any √{square root over (n/pn)}-consistent maximum penalized log-likelihood estimator {circumflex over (Θ)}n, as n→∞ it has:
1. Variable selection consistency:
P({circumflex over (B)}k2j)=0)→1,k=1, . . . ,K and j=1, . . . ,q.
2. Asympiotic normality:
where 1(Θ10) is the Fisher information matrix under the true subset model.
Proof. Write Θ=(θ1, θ2, . . . , θt
1) Let
For any given ε>0, there exists a constant Cε such that
2) For any Θ=(Θ1, Θ2) in the neighborhood ∥Θ−Θ0∥=O(√{square root over (pn/n)}), it has that
By (5.2), the left side of (5.1), denoted as A, becomes
where
i=1, . . . , n, are independent and identically distributed random vectors. They satisfy the condition of the Lindeberg-Feller central limit theorem and also
Theorem 1 can be easily derived from above conclusions. We refer readers to Khalili and Lin (2013) for detailed proofs.
By Theorem 1.I, {circumflex over (Θ)}n with respect to LASSO or Elastic Net penalty has convergence rate √{square root over (pn/n)} under appropriate choice of tuning parameters. Note that consistent estimator of Θ0 does not necessarily guarantee the variable selection consistency. Theorem 1.II looks into conditions under which the estimator is consistent in variable selection. However the LASSO or Elastic Net penalty cannot attain both properties simultaneously because the bias term q2n is proportional to λk/√{square root over (n)}, while λk must be large enough to achieve sparsity. This problem can be efficiently solved by using the adaptive LASSO or adaptive Elastic Net penalty instead, which fulfills the conditions required for both variable selection and estimation consistency.
Design of the Simulations
Two scenarios are designed for the unoverlapping and overlapping situations respectively. In each scenario, data are generated from model (2.1) with K=3, pn=15, q=3 and n=150; 450 respectively. Predictors xi are IID from Npn (0; Σ1) and random errors ε1 are IID from Nq(0, Σ) with Σ1 (i, j)=0.5|i-j| and Σ(i, j)=0.75|i-j|.
The sparse coefficient matrixes Bk, k=1, 2, 3 are generated as follows,
Bk=W⊗S⊗T, (5.3)
where ⊗ indicates the element-wise product. Each entry of W is drawn independently from N(0, 1), each entry of S independently from the Bernoulli distribution B(1, p1), and rows of T (either all 1 or all 0) are determined by pn independent Bernoulli draws from B(1, p2). Thus for each response variable, we expect p1p2pn relevant predictors, and there will be (1−p2)pn predictors in expectation to be irrelevant to all q responses. We have for all scenarios p1=0:5 and p2=0:9. Simulations are repeated for 50 times. For each repetition, a new sequence of Bk, k=1; 2; 3 are generated. To show that our proposed method has the ability to identify overlapping clusters, we assume K is known in following simulations.
Scenario 1: The 3 clusters are unoverlapping, each cluster contains n/3 observations.
Scenario 2: 70% of the observations involve in single clusters, 22% of the observations involve in two clusters and 8% of the observations involve in three clusters.
Cluster Recovery Quality Measures
Quality measures in Baeza-Yates, Ribeiro-Neto et al. (1999) are used to evaluate the clustering performance of the proposed method. Suppose we want to compare a target cluster A and a retrieved cluster Â. Ns denotes the number of observations in set S. The quality measures are defined as
The F1 measure, taken as the harmonic mean of specificity and sensitivity, gives an overall measure of the clustering performance.
In order to compare a sequence of target clusters, we use a one-to-one correspondence match approach (Turner, Bailey and Krzanowski 2005). We first make the number of retrieved clusters to be the same as the number of target clusters by adding null clusters to retrieved clusters or dropping additional poorly retrieved clusters. Retrieved clusters are then matched to target clusters via a one-to-one correspondence. Finally calculate the mean pair-wise quality measures. The optimal one-to-one match is the one producing the highest mean pair-wise F1 measure.
Cluster Model Comparisons
A sequence of methods have been implemented in simulations including the plaid model (plaid) and its revised counterparts (aplaid); the FMR model with ‘flexmix’ R package (EM) in Leisch (2004) and Grun and Leisch (2008) where the multivarite responses are treated as independent and overlapping problems are not considered; our proposed generalized FMMR model with the multivariate responses treated as independent (gEMseplasso0); our proposed generalized FMMR model with separate lasso estimation for Bk in (3.12) (gEMseplasso); and finally our proposed original method (gEMmrce).
The clustering performances were evaluated via quality measures as well as the sum of squared estimation errors (SSE) for model coefficients. Results for the example simulations are summarized in
Therapeutic Biomarker Screening for the Sanger Dataset
We applied the innovative techniques herein to analyzing the SANGER high throughput drug sensitivity dataset for cancer. An updated version of this data set (cancerrxgene.org/downloads/) contains 140 drugs as response variables, 13831 genomic features as covariates (including tissue type, rearrangements, mutation status of 71 cancer genes, continuous copy number data of 426 genes causally implicated in cancer, as well as genome-wide transcriptional pro les), and 707 human tumor cell lines as samples. The responses, made up with IC50 values from pairwise drug-cell-line screening, has some missing values. Therefore the data was first filtered by removing cell lines for which less than 50% of the drugs were tested, resulting in 591 cell lines remaining. In the remaining data, about 37.4% of the cell lines are with missing values. These missing values are imputed via the random forest imputation algorithm (Ishwaran, Kogalur, Blackstone and Lauer 2008; Stekhoven and Buhlmann 2012).
In order to identify patterns of cancer-specific therapeutic biomarkers, we employed method gEMseplasso in the analysis. Note that by “cancer-specific” we do not assume separate patterns for each cancer type but rather clusters that may be driven by one or a small number of cancer types. Throughout the analysis, we fixed K to a value of 3, although as previously shown, a BIC model selection approach could also be used. However, fixing the value of K does not limit the interesting findings that we can still find and saves on computational time.
Due to the scale and complexity of the data, the analysis, in this example, was conducted in three steps. Although simulation studies show that modeling with multivariate responses yields much higher clustering accuracy than modeling with single response, it is unreasonable to simply fit all 140 drugs (responses) in one generalized FMMR model. Because by doing so, one assumes that the cell line assignments to components are the same for all 140 drugs, which can hardly be true. Thus in our analysis, we first divide the 140 drugs into groups and then fit each drug group with a generalized FMMR model, strategies of which lead to following three steps in detail.
In the first step, for example as implemented in a drug analysis computing system, the processing takes received drug data and fits each drug c by a generalized FMMR model. As a result, we obtain drug-specific cell line assignments, {circumflex over (Z)}c, as well as component-wise coefficient estimates, {circumflex over (B)}kc for k=1, . . . , 3.
In the second step, the process uses an affinity-propagation clustering (APC) algorithm (Frey and Dueck 2007) to group the 140 drugs based on results from the first step. The grouping is conducted in a two-level nested manner. In the first level, the APC algorithm is applied to all 140 drugs. The pair-wise similarity matrix required by the algorithm as input data is calculated from the Euclidean distance between {circumflex over (Z)}a and {circumflex over (Z)}b. In the second level, the APC algorithm is utilized again within every first-level drug group. The pair-wise similarity matrix is calculated from the Euclidean distance between {circumflex over (B)}ka and {circumflex over (B)}kb for k=1, . . . , 3. Consequently, drugs having similar cell line assignments and component-wise coefficient estimates are grouped together. Resulting drug groups are shown in
Finally, in the third step of the process, each second-level drug group from step two is fitted by a generalized FMMR model. We get specific cell line assignments, {circumflex over (Z)}C, and component-wise coefficient estimates {circumflex over (B)}kC, k=1, . . . , 3, for every second-level drug group C.
As an illustration, the membership and coefficient estimates of cell line component 1 and 12 for every second-level drug group are shown in
Next, we show how the present techniques can be used to identify predictive strategies for circumventing drug resistance based on mutation data.
Overlapping cell lines can be used to guide drug combinations and identify potential drug synergies. Based on estimates of cell line assignments {circumflex over (Z)}C, the process can cluster the IC50 values with respect to cell line components for each drug.
Similar studies were conducted for drug MS.275 with respect to component 1, 2, and 12 and for drug GW843682X with respect to component 2, 3 and 23. Results are shown in left and right columns of
In
Any number of cell lines can be examined against drugs and drug combinations, with the present techniques. Examples include cells that are cancer cells, including but not limited to thyroid cancer cells, soft tissue cancer cells, skin cancer cells, pancreatic cancer cells, nervous system cancer cells, lung cancer cells, kidney cancer cells, digestive system cancer cells, breast cancer cells, bone cancer cells, blood cancer cells, digestive tract cancer cells, and urogenital cancers cells.
In some of the examples, at least some examined cell lines correspond to a primary cancer, and wherein an overlapping cell line cluster corresponds to the primary cancer. In yet other examples, at least some of cell lines correspond to a primary cancer, and an overlapping cell line cluster includes a secondary cancer different than the primary cancer.
The present techniques can automatically identify, from a database of drugs, plurality of alternative drugs exhibiting a drug responsiveness value between a high drug responsiveness value and a low drug responsiveness value, such that the alternative drugs have the highest drug responsiveness for the overlapping cell line cluster. This can be used to maximize the likelihood of success of the alternative drugs. In automated drug delivery systems, the alternative drug may then be administered to the overlapping cell line cluster for testing.
We note that an alternative drug may comprise at least one drug combination comprising a treatment drug and an additive drug. In other examples, the alternative drug comprises at least one drug combination that does not contain the treatment drug.
At a block 1308, an optimality algorithm process is performed on data for a cluster of drugs 1308, as the implementation of Algorithm 1. From the optimization, one or more alternative drugs are identified (block 1310), drugs exhibiting a desired drug responsiveness, measured as a IC50 value. With the drugs being identified, the one or more alternative drugs may then be applied to cell lines for treatment (1312) and for collecting new drug responsiveness data.
The program memory 106 and/or the RAM 110 may store various applications (i.e., machine readable instructions) for execution by the processor 108. For example, an operating system 130 may generally control the operation of the signal-processing device 102 and provide a user interface to the signal-processing device 102 to implement data processing operations. The program memory 106 and/or the RAM 110 may also store a variety of subroutines 132 for accessing specific functions of the signal-processing device 102. By way of example, and without limitation, the subroutines 132 may include, among other things: a subroutine to perform the process steps of
The subroutines 132 may also include other subroutines, for example, implementing software keyboard functionality, interfacing with other hardware in the signal processing device 102, etc. The program memory 106 and/or the RAM 110 may further store data related to the configuration and/or operation of the signal-processing device 102, and/or related to the operation of the one or more subroutines 132. For example, the data may be data gathered from the databases 115 and 116, data determined and/or calculated by the processor 108, etc. In addition to the matrix generator 104, the signal-processing device 102 may include other hardware resources. The signal-processing device 102 may also include various types of input/output hardware such as a visual display 126 and input device(s) 128 (e.g., keypad, keyboard, etc.). In an embodiment, the display 126 is touch-sensitive, and may cooperate with a software keyboard routine as one of the software routines 132 to accept user input.
It may be advantageous for the signal-processing device 102 to communicate with a medical treatment device, medical drug testing databases, biomarker/genomic testing databases, medical data records storage device, through the network 117 or through any of a number of known networking devices and techniques (e.g., through a commuter network such as a hospital or clinic intranet, the Internet, etc.). For example, the signal-processing device may be connected to a medical records database, hospital management processing system, healthcare professional terminals (e.g., doctor stations, nurse stations), high throughput screening framework, drug/biomarker databases, or other system.
The system 100 may be implemented as computer-readable instructions stored on a single dedicated machine, for example, one with one or more computer processing units. In some examples, the dedicated machine performs only the functions described in the processes of
In some examples, one or more of the functions of the system 100 may be performed remotely, including, for example, on a server connected to a remote computing device, through a wired or wireless interface at 112 and the network 117. Such distributed processing may include having all or a portion of the processing of system 100 performed on a remote server. In some embodiments, the techniques herein may be implemented as software-as-a-service (SaaS) with the computer-readable instructions to perform the method steps being stored on one or more of the computer processing devices and communicating with one or more user devices, including but not limited to personal computers, handheld devices, etc.
Provided are computer-implemented processes built upon new models for identifying therapeutic biomarkers for cancer and other pathologies, where these techniques can answer specific questions regarding sensitivity, resistance and synergy across large sets of possible treatment drugs. In an example, a penalized likelihood approach for an FMMR model was used and which enforced sparsity in the genomic features. To enable overlapping clustering, the FMMR model was then generalized and a new EM algorithm derived for estimation of model parameters implemented. While the noteworthy plaid model can also be generalized for overlapping clustering multivariate regression data, the generalized FMMR model markedly outperforms that method.
Throughout this specification, plural instances may implement components, operations, or structures described as a single instance. Although individual operations of one or more methods are illustrated and described as separate operations, one or more of the individual operations may be performed concurrently, and nothing requires that the operations be performed in the order illustrated. Structures and functionality presented as separate components in example configurations may be implemented as a combined structure or component. Similarly, structures and functionality presented as a single component may be implemented as separate components. These and other variations, modifications, additions, and improvements fall within the scope of the subject matter herein.
Additionally, certain embodiments are described herein as including logic or a number of routines, subroutines, applications, or instructions. These may constitute either software (e.g., code embodied on a machine-readable medium or in a transmission signal) or hardware. In hardware, the routines, etc., are tangible units capable of performing certain operations and may be configured or arranged in a certain manner. In example embodiments, one or more computer systems (e.g., a standalone, client or server computer system) or one or more hardware modules of a computer system (e.g., a processor or a group of processors) may be configured by software (e.g., an application or application portion) as a hardware module that operates to perform certain operations as described herein.
In various embodiments, a hardware module may be implemented mechanically or electronically. For example, a hardware module may comprise dedicated circuitry or logic that is permanently configured (e.g., as a special-purpose processor, such as a field programmable gate array (FPGA) or an application-specific integrated circuit (ASIC)) to perform certain operations. A hardware module may also comprise programmable logic or circuitry (e.g., as encompassed within a general-purpose processor or other programmable processor) that is temporarily configured by software to perform certain operations. It will be appreciated that the decision to implement a hardware module mechanically, in dedicated and permanently configured circuitry, or in temporarily configured circuitry (e.g., configured by software) may be driven by cost and time considerations.
Accordingly, the term “hardware module” should be understood to encompass a tangible entity, be that an entity that is physically constructed, permanently configured (e.g., hardwired), or temporarily configured (e.g., programmed) to operate in a certain manner or to perform certain operations described herein. Considering embodiments in which hardware modules are temporarily configured (e.g., programmed), each of the hardware modules need not be configured or instantiated at any one instance in time. For example, where the hardware modules comprise a general-purpose processor configured using software, the general-purpose processor may be configured as respective different hardware modules at different times. Software may accordingly configure a processor, for example, to constitute a particular hardware module at one instance of time and to constitute a different hardware module at a different instance of time.
Hardware modules can provide information to, and receive information from, other hardware modules. Accordingly, the described hardware modules may be regarded as being communicatively coupled. Where multiple of such hardware modules exist contemporaneously, communications may be achieved through signal transmission (e.g., over appropriate circuits and buses) that connects the hardware modules. In embodiments in which multiple hardware modules are configured or instantiated at different times, communications between such hardware modules may be achieved, for example, through the storage and retrieval of information in memory structures to which the multiple hardware modules have access. For example, one hardware module may perform an operation and store the output of that operation in a memory device to which it is communicatively coupled. A further hardware module may then, at a later time, access the memory device to retrieve and process the stored output. Hardware modules may also initiate communications with input or output devices, and can operate on a resource (e.g., a collection of information).
The various operations of the example methods described herein may be performed, at least partially, by one or more processors that are temporarily configured (e.g., by software) or permanently configured to perform the relevant operations. Whether temporarily or permanently configured, such processors may constitute processor-implemented modules that operate to perform one or more operations or functions. The modules referred to herein may, in some example embodiments, comprise processor-implemented modules.
Similarly, the methods or routines described herein may be at least partially processor-implemented. For example, at least some of the operations of a method may be performed by one or more processors or processor-implemented hardware modules. The performance of certain of the operations may be distributed among the one or more processors, not only residing within a single machine, but also deployed across a number of machines. In some example embodiments, the processor or processors may be located in a single location (e.g., within a home environment, an office environment or as a server farm), while in other embodiments the processors may be distributed across a number of locations.
The one or more processors may also operate to support performance of the relevant operations in a “cloud computing” environment or as a “software as a service” (SaaS). For example, at least some of the operations may be performed by a group of computers (as examples of machines including processors), these operations being accessible via a network (e.g., the Internet) and via one or more appropriate interfaces (e.g., application program interfaces (APIs).)
The performance of certain of the operations may be distributed among the one or more processors, not only residing within a single machine, but also deployed across a number of machines. In some example embodiments, the one or more processors or processor-implemented modules may be located in a single geographic location (e.g., within a home environment, an office environment, or a server farm). In other example embodiments, the one or more processors or processor-implemented modules may be distributed across a number of geographic locations.
Unless specifically stated otherwise, discussions herein using words such as “processing,” “computing,” “calculating,” “determining,” “presenting,” “displaying,” or the like may refer to actions or processes of a machine (e.g., a computer) that manipulates or transforms data represented as physical (e.g., electronic, magnetic, or optical) quantities within one or more memories (e.g., volatile memory, non-volatile memory, or a combination thereof), registers, or other machine components that receive, store, transmit, or display information.
As used herein any reference to “one embodiment” or “an embodiment” means that a particular element, feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment. The appearances of the phrase “in one embodiment” in various places in the specification are not necessarily all referring to the same embodiment.
Some embodiments may be described using the expression “coupled” and “connected” along with their derivatives. For example, some embodiments may be described using the term “coupled” to indicate that two or more elements are in direct physical or electrical contact. The term “coupled,” however, may also mean that two or more elements are not in direct contact with each other, but yet still co-operate or interact with each other. The embodiments are not limited in this context.
As used herein, the terms “comprises,” “comprising,” “includes,” “including,” “has,” “having” or any other variation thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, article, or apparatus that comprises a list of elements is not necessarily limited to only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Further, unless expressly stated to the contrary, “or” refers to an inclusive or and not to an exclusive or. For example, a condition A or B is satisfied by any one of the following: A is true (or present) and B is false (or not present), A is false (or not present) and B is true (or present), and both A and B are true (or present).
In addition, use of the “a” or “an” are employed to describe elements and components of the embodiments herein. This is done merely for convenience and to give a general sense of the description. This description, and the claims that follow, should be read to include one or at least one and the singular also includes the plural unless it is obvious that it is meant otherwise.
While the present invention has been described with reference to specific examples, which are intended to be illustrative only and not to be limiting of the invention, it will be apparent to those of ordinary skill in the art that changes, additions and/or deletions may be made to the disclosed embodiments without departing from the spirit and scope of the invention.
The foregoing description is given for clearness of understanding; and no unnecessary limitations should be understood therefrom, as modifications within the scope of the invention may be apparent to those having ordinary skill in the art.
This application claims priority to U.S. Provisional Application No. 62/368,997, entitled Precision Therapeutic Biomarker Screening for Cancer, and filed Jul. 29, 2016, which is incorporated herein by reference in its entirety.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2017/044712 | 7/31/2017 | WO |
Publishing Document | Publishing Date | Country | Kind |
---|---|---|---|
WO2018/023120 | 2/1/2018 | WO | A |
Number | Name | Date | Kind |
---|---|---|---|
20080118576 | Theodorescu et al. | May 2008 | A1 |
20100042563 | Livingston | Feb 2010 | A1 |
20100144554 | Semizarov et al. | Jun 2010 | A1 |
20100198900 | Gifford | Aug 2010 | A1 |
20150072878 | Buechler | Mar 2015 | A1 |
20160102365 | Ince | Apr 2016 | A1 |
Entry |
---|
Monga, Manish, and Edward A. Sausville. “Developmental therapeutics program at the NCI: molecular target and drug discovery process.” Leukemia 16.4 (2002): 520-526. (Year: 2002). |
Akaike, A new look at the statistical model identification, IEEE transactions on automatic control, 19(6):716-723 (1974). |
Anderson, Avoiding pitfalls when using informationtheoretic methods, The Journal of Wildlife Management, 912-918 (2002). |
Azzalini, The skew-normal distribution and related multivariate families, Scandinavian Journal of Statistics, 32(2):159-188 (2005). |
Bailey et al., Implementation of biomarker-driven cancer therapy: existing tools and remaining gaps, Discovery medicine, 17(92):101-114 (2014). |
Banerjee et al., Model-based overlapping clustering, Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, ACM, 532-537 (2005). |
Barretina et al., The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity, Nature, 483(7391):603-607 (2012). |
Basu et al., An Interactive Resource to Identify Cancer Genetic and Lineage Dependencies Targeted by Small Molecules, Cell, 154(5):1151-1161 (2013). |
Branco et al., A general class of multivariate skew-elliptical distributions, Journal of Multivariate Analysis, 79(1):99-113 (2001). |
Breiman et al., Predicting multivariate responses in multiple linear regression, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 59(1):3-54 (1997). |
Chen et al., Regularized multivariate regression models with skew-t error distributions, Journal of Statistical Planning and Inference, 149:125-139 (2014). |
Dempster et al., Maximum likelihood from incomplete data via the EM algorithm, Journal of the royal statistical society: Series B (methodological), 39(1):1-38 (1977). |
Desarbo et al., A maximum likelihood methodology for clusterwise linear regression, Journal of classification, 5(2):249-282 (1988). |
Druker et al., Five-year follow-up of patients receiving imatinib for chronic myeloid leukemia, New England Journal of Medicine, 355(23):2408-2417 (2006). |
Estivill-Castro, Why so many clustering algorithms: a position paper, ACM SIGKDD explorations newsletter, 4(1):65-75 (2002). |
Fan et al., A selective overview of variable selection in high dimensional feature space, Statistica Sinica, 20(1):101-148 (2010). |
Fan et al., Statistical challenges with high dimensionality: Feature selection in knowledge discovery, arXiv preprint math/0602133, (2006). |
Fraley et al., How many clusters? Which clustering method? Answers via model-based cluster analysis, The computer journal, 41(8):578-588 (1998). |
Frey et al., Clustering by passing messages between data points, science, 315(5814):972-976 (2007). |
Fu et al., Multiplicative mixture models for overlapping clustering, in 2008 Eighth IEEE International Conference on Data Mining, IEEE, 791-796 (2008). |
Galimberti et al., A multivariate linear regression analysis using finite mixtures of t distributions, Computational Statistics & Data Analysis, 71:138-150 (2014). |
Garnett et al., Systematic identification of genomic markers of drug sensitivity in cancer cells, Nature, 483(7391):570-575 (2012). |
Grun et al., FlexMix version 2: finite mixtures with concomitant variables and varying and constant parameters, (2008). |
Heller et al., A Nonparametric Bayesian Approach to Modeling Overlapping Clusters, in AISTATS, 187-194 (2007). |
International Application No. PCT/US17/44712, International Preliminary Report on Patentability, dated Feb. 7, 2019. |
International Application No. PCT/US17/44712, International Search Report and Written Opinion, dated Oct. 6, 2017. |
Ishwaran et al., Random survival forests, Ann. App. Stat., 2(3):841-860 (2008). |
Jain et al., Data clustering: a review, ACM computing surveys (CSUR), 31(3):264-323 (1999). |
Jedidi et al., On estimating finite mixtures of multivariate regression and simultaneous equation models, Structural Equation Modeling: A Multidisciplinary Journal, 3(3):266-289 (1996). |
Jiang et al., The E-MS algorithm: model selection with incomplete data, Journal of the American Statistical Association, 110(511):1136-1147 (2015). |
Keribin, Consistent estimation of the order of mixture models, Sankhy a: The Indian Journal of Statistics, Series, 62:49-66 (2000). |
Khalili et al., Regularization in finite mixture of regression models with diverging No. of parameters, Biometrics, 69(2):436-446 (2013). |
Khalili et al., Variable selection in finite mixture of regression models, Journal of the American Statistical Association 102(479):1025-1038 (2012). |
Lazzeroni et al., Plaid models for gene expression data, Statistica sinica, 12(1):61-86 (2002). |
Leisch, Flexmix: A general framework for finite mixture models and latent glass regression in R, (2004). |
Liu et al., Precision therapeutic biomarker identification with application to the cancer genome project, arXiv preprint arXiv:1702.02264, (2017). |
Martins et al., Linking Tumor Mutations to Drug Responses via a Quantitative Chemical-Genetic Interaction Map, Cancer discovery, 5(2):154-167 (2015). |
Merimsky et al., Gemcitabine in soft tissue or bone sarcoma resistant to standard chemotherapy: a phase II study, Cancer chemotherapy and pharmacology, 45(2):177-181 (2000). |
Peng et al., Regularized multivariate regression for identifying master predictors with application to integrative genomics study of breast cancer, Ann. Appl. Stat., 4(1):53-77 (2010). |
Rothman et al., Sparse multivariate regression with covariance estimation, Journal of Computational and Graphical Statistics, 19(4):947-962 (2010). |
Schwarz, Estimating the dimension of a model, The annals of statistics, 6(2):461-464 (1978). |
Segal et al., Decomposing gene expression into cellular processes, Biocomputing, 8:89-100 (2002). |
Simila et al., Input selection and shrinkage in multiresponse linear regression, Computational Statistics & Data Analysis, 52(1):406-422 (2007). |
Soffritti et al., Multivariate linear regression with non-normal errors: a solution based on mixture models, Statistics and Computing, 21(4):523-536 (2011). |
Stekhoven et al., MissForestnon-parametric missing value imputation for mixed-type data, Bioinformatics, 28(1):112-118 (2012). |
Stone, An asymptotic equivalence o f choice of model by cross-validation and Akaike's criterion, Journal of the Royal Statistical Society: Series B (Methodological), 39(1):44-47 (1977). |
Tibshirani, Regression shrinkage and selection via the lasso, Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267-288 (1996). |
Turlach et al., Simultaneous variable selection, Technometrics, 47(3):349-363 (2005). |
Turner et al., Biclustering models for structured microarray data, IEEE/ ACM Transactions on Computational Biology and Bioinformatics (TCBB), 2(4):316-329 (2005). |
Turner et al., Improved biclustering of microarray data demonstrated through systematic performance tests, Computational statistics & data analysis, 48(2):235-254 (2005). |
Wildey et al., Pharmacogenomic approach to identify drug sensitivity in small-cell lung cancer, PloS one, 9(9):e106784 (2014). |
Zhang, A Bayesian model for biclustering with applications, Journal of the Royal Statistical Society: Series C (Applied Statistics), 59(4):635-656 (2010). |
Zou et al., Regularization and variable selection via the elastic net, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2):301-320 (2005). |
Number | Date | Country | |
---|---|---|---|
20190161784 A1 | May 2019 | US |
Number | Date | Country | |
---|---|---|---|
62368997 | Jul 2016 | US |