Embodiments of the disclosure relate to subtyping subjects according to phenotypic information, particularly in the case where the phenotypic information is multidimensional.
It is desirable to classify subjects into phenotypic groups to improve treatment and/or risk management. Detecting phenotypic subgroups of patients suffering from complex diseases such as Parkinson's disease (PD) and Chronic Obstructive pulmonary disease (COPD), for example, can allow stratified risk assessment. Furthermore, it can provide support for early detection of deteriorating patients, determination of individualized and customized treatment, and prevention strategies for different phenotypic groups, which ultimately results in enhanced treatment outcome. There would also be significant value for understanding patient phenotypes for improving treatments, conducting clinical trials, etc.
It is an object of the invention to provide improved methods and apparatus for identifying phenotypic groups of subjects.
According to an aspect of the invention, there is provided a computer-implemented method of subtyping subjects based on phenotypic information, comprising: receiving a subject data unit for each of a plurality of subjects, each subject data unit representing a plurality of different phenotypic information items about the subject of the subject data unit; using a deep learning algorithm to derive a lower dimensional representation of each subject data unit and a clustering algorithm to detect clusters of the resulting lower dimensional representations, each cluster representing a subtype of subjects that are phenotypically related to each other, wherein: the deep learning algorithm and clustering algorithm are implemented by a single mathematical model in which the derivation of the lower dimensional representations and the detection of the clusters are performed jointly.
Thus, a method is provided in which a deep learning algorithm and clustering algorithm are implemented in a joint framework. This allows the process of determining representations of high-dimensional features in the input data (the subject data units) to inform the clustering process and vice versa, which the inventors have found significantly improves performance relative to alternative approaches in which clustering is performed without dimension reduction or where dimension reduction and clustering are performed completely separately. The improved performance allows subjects to be clustered into groups more meaningfully and efficiently, thereby enabling management of subjects (e.g. risk management, treatment plan selection, etc.) to be performed more reliably and/or more efficiently.
In an embodiment, the joint performance of the derivation of the lower dimensional representations and the detection of the clusters comprises optimizing a unified loss function having a term corresponding to the derivation of the lower dimensional representations and a term corresponding to the detection of the clusters, optionally with a regularization term. The inventors have found that performing the joint optimization based on a unified loss function can be implemented particularly efficiently.
According to an alternative aspect, there is provided an apparatus an apparatus for subtyping subjects based on phenotypic information, comprising: a data receiving unit configured to receive a subject data unit for each of a plurality of subjects, each subject data unit representing a plurality of different phenotypic information items about the subject of the subject data unit; and a data processing unit configured to: use a deep learning algorithm to derive a lower dimensional representation of each subject data unit and a clustering algorithm to detect clusters of the resulting lower dimensional representations, each cluster representing a subtype of subjects that are phenotypically related to each other, wherein: the deep learning algorithm and clustering algorithm are implemented by a single mathematical model in which the derivation of the lower dimensional representations and the detection of the clusters are performed jointly.
Embodiments of the invention will now be described, by way of example only, with reference to the accompanying drawings in which corresponding reference symbols indicate corresponding parts, and in which:
Methods of the present disclosure are computer-implemented. Each step of the disclosed methods may therefore be performed by a computer. The computer may comprise various combinations of computer hardware, including for example CPUs, RAM, SSDs, motherboards, network connections, firmware, software, and/or other elements known in the art that allow the computer hardware to perform the required computing operations. The required computing operations may be defined by one or more computer programs. The one or more computer programs may be provided in the form of media, optionally non-transitory media, storing computer readable instructions. When the computer readable instructions are read by the computer, the computer performs the required method steps. The computer may consist of a self-contained unit, such as a general-purpose desktop computer, laptop, tablet, mobile telephone, smart device (e.g. smart TV), etc. Alternatively, the computer may consist of a distributed computing system having plural different computers connected to each other via a network such as the internet or an intranet.
As explained in the introductory part of the description, it is desirable to subtype (which may also be referred to as group, cluster or classify) subjects into phenotypic subtypes (groups, clusters or classes) to improve treatment and/or risk management. The following detailed description provides example approaches for achieving this in an efficient way. The methods disclosed can be provided as part of a pipeline involving data curation and pre-processing (cleaning, imputation, and feature selection), as well as the clustering methods described specifically below with reference to the figures. The clustering methods disclosed can be used to allow accurate identification of phenotypic subtypes in patient cohorts for complex disease, which can be used for example to stratify patients with complex diseases into subtypes with differing disease progression and risk of disease complications. The sub-stratification of the diseases makes it possible to more efficiently screen risk factors (genetic or/and environmental) and/or tailor and target early treatment to patients, thereby enabling a route towards precision medicine and associated improvements in healthcare delivery and patient outcomes.
In an embodiment, the method comprises a step S1 of receiving a subject data unit 20 for each of a plurality of subjects. Thus, a set comprising a plurality of subject data units 20 is received, as depicted schematically in the top left of
In step S2A, a deep learning algorithm 23 is used to derive a lower dimensional representation of each subject data unit 20 (i.e. having lower dimensions than the original subject data unit 20). In step S2B, a clustering algorithm 24 is used to detect clusters 25-27 (see
Exemplary configurations for the single mathematical model 22 are now described in further detail with reference to
In an embodiment, the mathematical model 22 is configured so that the clustering algorithm 24 provides supervisory signals to the deep learning algorithm 23. In a particular example described below, the deep learning algorithm 23 is an autoencoder (AE) deep representation learning algorithm and the clustering algorithm 24 is an unsupervised Gaussian Mixture Model (GMM) clustering model.
To achieve the above goal, the NN may be trained with a loss function Ld(X, {circumflex over (X)}):
where L(xi, {circumflex over (x)}l′) is the loss function that characterizes the reconstruction error caused by the deep AE in the compression network. The loss function may comprise the root mean square error or another error metric. It is desirable to achieve the lowest reconstruction error possible to ensure the low-dimensional representation contains as much of the information present in the high-dimensional data as possible.
After the dimensional reduction by the deep learning algorithm 23, the latent feature Z is fed (arrow 28) to the clustering algorithm 24. In an embodiment, clustering algorithm 24 is parametric model-based (e.g. GMM) or nonparametric (such as hierarchical clustering). GMM is used as an exemplary clustering algorithm 24 for the following description. It is understood that the GMM could be replaced by a different clustering algorithm 24.
In the GMM setting, we assume the investigated heterogeneous sample Z has finite mixture of multivariate normal densities:
is the multivariate Gaussian density with θk=(μk,Σk), K the number of the clustering components, and πk the proportions of the kth component, μk, Σk are the mean and covariance of data belonging to the kth components.
To learn the parameters, i.e. θk, Σk, a well-established algorithm—Expectation-Maximization Algorithm (EM algorithm) can be applied to update the parameters. As the name indicates, there are two steps in this algorithm: the expectation step and the maximization step. In the expectation step, the probability {circumflex over (γ)}=softmax(p), i.e. the cluster membership matrix which assigns the portion of data to be part of the kth cluster, can be computed. In the maximization step, the parameters πk, μk, Σk are updated as:
The optimal parameters can then be obtained through the minimization of the negative likelihood of the model:
L
c(Z,θc)=Σi=1N log(Σi=1KφkØ(Z;θk))
The proposed joint framework combines the abovementioned deep representation learning and the clustering into a single model with a unified loss function:
U(θd,θc)=λdLd(X,{circumflex over (X)})+λcLc(Z,θc)+λrLr
where the Ld(x, {circumflex over (X)}) is the loss function of the dimensionality reduction, Lc(Z,θc) the loss function for the clustering, Lr the regulation item, and the λd,λc,λr are the hyperparameters that can make the unified loss function work best. Thus, the joint performance of the derivation of the lower dimensional representations and the detection of the clusters may comprise optimizing a unified loss function having a term corresponding to the derivation of the lower dimensional representations (Ld) and a term corresponding to the detection of the clusters (Lc), optionally with a regularization term (Lr).
By optimizing the unified loss function with a number of iterations of training of the deep learning algorithm 23 as well as the clustering algorithm 24, it is possible to obtain not only more powerful feature representations, but also precise assignment of data into corresponding clusters.
In step S3, a further subject data unit is obtained. The further subject data unit comprises a plurality of different phenotypic information items about a subject to be assessed. The further subject data unit may take any of the forms described above for the other subject data units. The single mathematical model 22 is used to derive a lower dimensional representation of the subject data unit and assign the lower dimensional representation of the subject data unit to one of the detected clusters 25-27, thereby identifying to which of the clusters the subject to be assessed belongs. Thus, steps S1-S2 effectively train the method by generating clusters of subject data units from reference subjects. A subject data unit from a new subject can then be processed to determine which of the clusters the new subject belongs to, thereby subtyping the new subject.
Aspects of the above-described methods may be implemented by an apparatus 5 such as that depicted in
An exemplary application of a method of an embodiment to identify subtypes of Parkinson's Disease (PD) is now described. PD is a typical complex and heterogeneous disease. In this example, the deep learning algorithm 23 is an autoencoder (AE) and the clustering algorithm 24 is an unsupervised Gaussian Mixture Model (GMM) clustering model. The phenotypic information items 21 comprise 23 laboratory test items in this example (mainly blood biomarkers, but other information such as neuroimaging, genetic, clinical, medical imaging, demographic, and so on could be used in extensions of the example), such that each subject data unit 20 has 23 dimensions. The laboratory test items correspond to the first laboratory assessment of the patient and are commonly prescribed as an initial health assessment indicator in this area. The AE deep learning algorithm 23 was used to extract the abstract representations of the 23-dimensional variables by transforming the 23D variables into 3D, which is then feed to the GMM clustering algorithm 24 to update the clusters.
With further analysis of the clusters identified by the method (representing subtypes of the complex disease in this example), the inventors found that each subtype represents a different stage of the disease progression, and the subpopulation of each subtype features similar clinical manifestations. All those findings could provide guidance for treatment decisions of a given individual. If the subtype is found to have causal and clinically justified association with underlying mechanism, it can serve as an automated mechanism for understanding the aetiology of the disease.
Number | Date | Country | Kind |
---|---|---|---|
1807308.0 | May 2018 | GB | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/GB2019/050682 | 3/12/2019 | WO | 00 |