Various embodiments of the present disclosure regard solutions for estimating a variable of interest associated to a given disease as a function of a plurality of different omics data of a patient.
It is deemed that gene profiles can be linked to the risk of developing a given disease and/or to the prognosis of evolution of the disease, such as a particular type of cancer or neoplasm. For instance, the prognosis is frequently quantified through a quantity that indicates the time of disease-free survival (DFS) after a particular treatment.
With the invention and recent reduction in cost of next-generation sequencing (NGS) technology, a large amount of omics data has become progressively available for biocomputational analyses. In particular, NGS technology is an in vitro process of analysis that comprises sequencing in parallel and that enables sequencing of large genomes over a very restricted time. The term “omics” refers to data that identify genomics, transcriptomics, proteomics, metabolomics, and/or metagenomics.
This has in turn increased the interest in the development of automatic learning, i.e., machine learning (ML), models that are able to decode the correlations between different gene profiles (also referred to as “omics profiles”), for example with reference to gene mutation (differences between individuals encoded in different DNA sequences), expression (differences between individuals and tissues deriving from the effective process used for producing proteins, encoded in the RNA), and copy numbers (differences between individuals and tissues deriving from a different copy numbers of a given gene). For instance, in this context, European patents EP 1 977 237 B1, EP 2 392 678 B1, EP 2 836 837 B1, EP 3 237 638 B1 or EP 2 700 038 B1 may be cited.
For instance, to estimate the risk of developing a given disease or the disease-free survival time of a patient, the machine-learning model may comprise a parameterized mathematical function, such as an artificial neural network, configured for estimating a quantity of interest that corresponds, respectively, to the risk of developing the disease or the disease-free survival time, as a function of the omics data obtained for a given patient. In particular, by acquiring a training dataset that comprises the omics data of a plurality of patients and the respective information as to whether the patient has developed the disease, or the respective disease-free survival time of each patient, a training algorithm can modify, typically through an iterative process, the parameters of the mathematical function in such a way as to reduce the difference between the estimations of the quantity of interest and the respective data of the dataset. Consequently, once the learning model has been trained, the mathematical function can provide an estimate of the quantity of interest, i.e., the risk of developing the disease or the disease-free survival time as a function of the respective omics data of a patient.
One of the main problems in this field is that the training dataset typically includes a limited number of samples, especially if compared to the enormous number of features that need to be analyzed for an effective prognosis. In fact, clinical trials typically involve from a few hundreds up to 1500-2000 patients, whereas NGS data may potentially provide as many features as are the human genes: approximately 20000 in the case of transcriptome if the analysis is limited to the genes that encode proteins, and a factor of two to three times more if also the non-encoding genes are considered. From the standpoint of automatic learning, this problem of dimensionality (generally known as “curse of dimensionality”) prevents the models from learning, leading to an over-adaptation with respect to the training dataset and to the loss of the capacity of generalization since there are not enough samples for extrapolating a general behaviour.
Various embodiments of the present disclosure hence provide new solutions that are able to handle the aforesaid problem of dimensionality by extracting mainly the relevant information from the data, thus reducing the number of features and solving the problem of dimensionality.
According to one or more embodiments, the above object is achieved through a method having the distinctive elements set forth specifically in the ensuing claims. The embodiments moreover regard a corresponding device, as well as a corresponding computer program product that can be loaded into the memory of at least one computer and comprises portions of software code for executing the steps of the method when the product is run on a computer. As used herein, reference to such a computer program product is understood as being equivalent to reference to a computer-readable means containing instructions for controlling a processing system in order to co-ordinate execution of the method. Reference to “at least one computer” is clearly intended to highlight the possibility of the present description being implemented in a distributed/modular way.
The claims form an integral part of the technical teaching of the description provided herein.
As mentioned previously, various embodiments of the present disclosure relate to solutions for estimating the value of a variable of interest associated to a given disease, such as a neoplasm or cancer, such as a non-small-cell lung cancer, as a function of omics data of a patient. For instance, the variable of interest may indicate whether the respective patient has developed the disease, the seriousness of the disease that the respective patient has developed, or a disease-free survival time of the respective patient.
In various embodiments, during a training step, a computer receives a first dataset of omics data and a second dataset of omics data, where each dataset of omics data comprises the values of a respective plurality of variables that refer to the same genes for each reference patient of a plurality of reference patients. For instance, the datasets of omics data may be chosen from among: a dataset of transcriptomic data, where each variable corresponds to the gene expression of a particular gene, for example expressed in transcripts per million; a dataset of copy number variation data, where each variable corresponds to the copy number variations of a particular gene; and a dataset of mutation data, where each variable corresponds to the mutation data of a particular gene.
In various embodiments, the computer then generates a multi-layer network comprising a first layer and a second layer. In particular, for this purpose, the computer associates to each variable of the first dataset of omics data a respective node in the first layer and to each variable of the second dataset of omics data a respective node in the second layer. Next, the computer generates intra-omics connections, calculating for each pair of nodes of the first layer and each pair of nodes of the second layer a respective value of similarity as a function of the data of the respective variables associated to the two nodes, and calculating for each pair of nodes of the first layer and each pair of nodes of the second layer a respective weight associated to the connection between the respective nodes as a function of the value of similarity of the respective pair of nodes. For instance, in various embodiments, in the case where the respective dataset of omics data comprises transcriptomic data or copy number variation data, the computer calculates the respective similarity values via the biweight-midcorrelation metric. Instead, in the case where the respective dataset of omics data comprises mutation data, the computer calculates the respective similarity values via the normalized mutual information. Moreover, the computer generates inter-omics connections, calculating for each pair of nodes between the first layer and the second layer a respective value of similarity as a function of the data of the respective variables associated to the two nodes, and calculating for each pair of nodes between the first layer and the second layer a respective weight associated to the connection between the respective nodes as a function of the value of similarity of the respective pair of nodes. For instance, in various embodiments, in the case where the respective datasets of omics data comprise transcriptomic data and copy number variation data, the computer calculates the respective similarity values via the biweight-midcorrelation metric. Instead, in the case where the respective datasets of omics data comprise other data, the computer can calculate the respective similarity values via the point-biserial correlation.
In various embodiments, the computer then removes non-salient intra-omics connections and inter-omics connections of the multi-layer network by applying to the weights associated to the intra-omics connections between the nodes of the first layer, to the weights associated to the intra-omics connections between the nodes of the second layer, and to the weights associated to the inter-omics connections between the nodes of the first layer and the second layer, a backboning method, such as the NC (Noise-Corrected) method.
In various embodiments, the computer can use also a third dataset of omics data. In this case, the computer can then associate to each variable of the third dataset of omics data a respective node in a third layer and generate intra-omics connections, calculating, for each pair of nodes of the third layer, a respective value of similarity as a function of the data of the respective variables associated to the two nodes, and calculating, for each pair of nodes of the third layer, a respective weight associated to the connection between the respective nodes as a function of the value of similarity of the respective pair of nodes. Moreover, to generate inter-omics connections, the computer can calculate, for each pair of nodes between the first layer and the third layer and each pair of nodes between the second layer and the third layer, a respective value of similarity as a function of the data of the respective variables associated to the two nodes, and calculate, for each pair of nodes between the first layer and the third layer and each pair of nodes between the second layer and the third layer, a respective weight associated to the connection between the respective nodes as a function of the value of similarity of the respective pair of nodes. Likewise, the computer can then remove the non-salient intra-omics connections and inter-omics connections of the multi-layer network by applying also to the weights associated to the intra-omics connections between the nodes of the third layer, to the weights associated to the inter-omics connections between the nodes of the first layer and the nodes of the third layer, and to the weights associated to the inter-omics connections between the nodes of the second layer and the nodes of the third layer, the backboning method.
In various embodiments, the computer then identifies a plurality of communities/sub-networks of the multi-layer network, for example by performing a plurality of executions of the Infomap method. Next, the computer determines, via a feature-extraction method, such as the UMAP (Uniform Manifold Approximation and Projection) method, for each community one or more respective features as a function of the values of the variables associated to the nodes that belong to the respective community, and stores the mapping rules used for generating the one or more features as a function of the values of the variables.
In various embodiments, the computer then generates a training dataset, by obtaining, for each reference patient, a respective value of the variable of interest and calculating, for each reference patient, the respective values of the features associated to the communities as a function of the respective values of the variables of the reference patient using the mapping rules. Finally, the computer uses the training dataset for training a classifier configured for estimating the value of the variable of interest as a function of the values of the features.
Consequently, during an estimation step, the computer receives the values of the variables of the datasets of omics data for a further patient, calculates for the patient the values of the features as a function of the respective values of the variables of the patient using the mapping rules, and estimates, by means of the trained classifier, the value of the variable of interest as a function of the values of the features calculated for the patient.
The embodiments of the present disclosure will now be described with reference to the annexed drawings, which are provided purely by way of non-limiting example and in which:
In the ensuing description numerous specific details are provided in order to enable an in-depth understanding of the embodiments. The embodiments may be implemented without one or various specific details, or with other methods, components, materials, etc. In other cases, well-known operations, materials, or structures are not represented or described in detail so that the aspects of the embodiments will not be obscured.
Reference throughout the ensuing description to “an embodiment” or “one embodiment” means that a particular characteristic, distinctive element, or structure described with reference to the embodiment is comprised in at least one embodiment. Thus, use of phrases such as “in an embodiment” or “in one embodiment” in various points of this description do not necessarily refer to one and the same embodiment. Moreover, the details, characteristics, distinctive elements, or structures may be combined in any way in one or more embodiments.
The references used herein are provided merely for convenience and do not define the scope or meaning of the embodiments.
As explained previously, to enable a more accurate prognosis, recently also omics data are taken in consideration, which are processed by means of a machine-learning algorithm. For instance, in various embodiments, the present solution is used for estimating the disease-free survival time for patients affected by non-small cell lung cancer (NSCLC), who have undergone a surgical operation and have obtained complete removal of the neoplasm. However, the solution proposed herein may be used for estimating the disease-free survival time also for other diseases, and in particular for other types of neoplasms. For instance, such an estimate may represent auxiliary information fundamental for oncologists, who can make better-informed decisions, for example vary the frequency of the follow-up checks as a function of the time estimated, for instance increase the frequency of the checks for patients with an estimated time that is short. In general, the solution described herein may be used for estimating also one or more other quantities of interest that are correlated with the omics data of a patient, such as the risk of a patient developing a given disease and/or the risk of a patient developing a serious form of the disease.
For instance, as illustrated in
In the embodiment considered, after a starting step 1000, the computer 30 trains, in a step 1100, a machine-learning algorithm using a training dataset 200 that comprises omics data for a plurality of reference patients PR. Consequently, in a step 1200, the computer 30 can use the trained algorithm for estimating the quantity of interest, for example the disease-free survival time, of a patient as a function of the respective omics data 300, and the method terminates in an end step 1300.
In particular,
Finally,
For instance, some examples of these data are made available by the Cancer Genome Atlas (TCGA) and can be freely downloaded from many sources, such as https://gdac.broadinstitute.org/or https://www.cbioportal.org/. For instance, with reference to NSCLC, a dataset 200 can be used that comprises the databases TCGA lung adenocarcinomas (LUADs) and/or TCGA lung squamous-cell carcinomas (LUSCs), which currently represent the main histological subtypes of NSCLC.
In general, the datasets 2001, 2002 and 2003 may refer to different sets of patients PR and/or different genes. For instance, with reference to the NSCLC data of the TCGA database, the datasets 2001, 2002 and 2003 comprise the following data:
Consequently, as illustrated in
Consequently, as also illustrated in
In general, the step 1102 is purely optional, since the computer 30 could directly receive a similar dataset 200.
Consequently, as illustrated in
In this context, the inventors have noted that network analysis can be a very effective approach for analysing the biological complexity inherent in oncogenic processes. Computational methods based upon networks are typically applied for studying separately biological measurements, such as the omics data 200′1, 200′2, and 200′3, even if these are not altogether independent.
Instead, to take into consideration the data 200, a multi-network approach should be used. In fact, a multidimensional network is a set of nodes that interact with one another in a number of different dimensions or layers, each of which reflects a distinct type of interaction that connects one and the same pair of nodes. Multidimensional networks have emerged recently as a subject of great interest (e.g., BOCCALETTI, Stefano, et al., “The structure and dynamics of multilayer networks”, Physics reports, 2014, 544.1: 1-122; KIVELÄ, Mikko, et al., “Multilayer networks”, Journal of complex networks, 2014, 2.3: 203-271) and have produced precious information in multiple fields, amongst which integration of multiple omics in the oncological field (e.g., CANTINI, Laura, et al., “Detection of gene communities in multi-networks reveals cancer drivers”, Scientific reports, 2015, 5.1: 1-10; and HIDALGO, Sebastian J. Teran; M A, Shuangge, “Clustering multilayer omics data using MuNCut”, BMC genomics, 2018, 19.1: 1-13.).
Consequently, for the dataset 200 considered by way of example, the multidimensional network would have p nodes in three dimensions, where the three dimensions refer, respectively, to the data p′1, p′2, and p′3 that correspond the same p genes. In general, taking into consideration fewer or more datasets of omics data, the number of the layers can vary accordingly. Hence, in the context of omics data, the computer 30 would have to correlate an enormous amount of information, for example deriving from mutations, copy number variations, and data on mRNA gene expression, in a multidimensional network. In fact, as mentioned previously, the number p of the parameters of each network may be easily higher than 5000, 10000 or even 15000 genes.
For this purpose, the computer 30 generates for each of the datasets 200′1, 200′2, and 200′3 a respective weighted network 202, i.e., networks 2021, 2022, and 2023. In this case, as also illustrated in
In particular, for this purpose, the computer generates in step 1104 for each dataset 200′1, 200′2, and 200′3 a respective similarity matrix S. This matrix S is a symmetrical square matrix of size p×p, where each element s; contains a measurement of correlation between the gene i and the gene j (i.e., of the respective variables).
In particular, in various embodiments, the computer 30 uses for the dataset 200′1 (gene expression) and the dataset 200′2 (copy number variations) the so-called biweight-midcorrelation metric, since both datasets are of real values. This metrics is per se well known and has already been used for determining the correlations between gene expressions. For instance, for this purpose, we may cite the corresponding Wikipedia webpage “Biweight-midcorrelation” (available at the link https://en.wikipedia.org/wiki/Biweight midcorrelation) or the document by Zheng CH, Yuan L, Sha W, Sun ZL, “Gene differential coexpression analysis based on biweight correlation and maximum clique”, BMC Bioinformatics, 2014;15 Suppl. 15:S3, doi:10.1186/1471-2105-15-S15-S3, the contents of which are purposely incorporated herein for reference. In general, a Pearson correlation could also be used. However, as already indicated in the article by Zheng, this correlation tends to be too sensitive to outliers.
Instead, in various embodiments, the computer 30 uses for the dataset 200′3 (mutations) the mutual information, preferably normalized, since it contains only binary values 0 or 1. Also this measurement of correlation is in itself well known. For instance, for this purpose the corresponding Wikipedia webpage “Mutual information” (available at the link https://en.wikipedia.org/wiki/Mutual information) may be cited.
Consequently, in step 1104 the computer 30 determines for each dataset 200′1, 200′2, and 200′3 data that identify a measurement of similarity/correlation sij between the values of each of the p genes of the dataset (i.e., each node Ni that corresponds, respectively, to the variables p′1, p′2, or p′3) and all the other genes of the same dataset (i.e., the other nodes Nj of the network itself). For instance, subsequently these measurements are identified via matrices S1, S2, and S3, respectively, for the networks 2021, 2022, and 2023.
In various embodiments, once the computer 30 has calculated the measurements of similarity/correlation sij, it calculates (in step 1104) the weights wij, for example expressed (as for the matrix S) in the form of a weighted adjacency matrix W of the network, as a function of a respective measurement of similarity/correlation sij. In particular, in various embodiments, the computer 30 converts all the measurements of similarity/correlation sij into positive values; i.e., the computer 30 determines the absolute value of the measurement of similarity/correlation sij. In various embodiments, the computer 30 then determines a weight wij, moreover scaling the absolute value of the respective measurement sij, preferably by means of an exponential function, for example of the type:
w
ij
=|s
ij|β (1)
This adjacency function (the function that maps the similarity matrix S into an adjacency matrix M) then makes it possible to avoid the disadvantages of a decision via thresholds, i.e., the so-called hard thresholding and, thanks to the use of absolute values, makes it possible to have a contribution for the genes correlated both positively and negatively (e.g., ZHANG, Bin; HORVATH, Steve, “A general framework for weighted gene co-expression network analysis”, Statistical applications in genetics and molecular biology, 2005, 4.1). Moreover, the coefficient β makes it possible to follow the criterion of the scale-free topology. As will be described in greater detail hereinafter, the structure of the network is subsequently reduced on the basis of the values of the weights wij. Consequently, the value of β should be chosen in such a way as to obtain a network with:
For instance, in various embodiments the exponent β is chosen:
Consequently, by the end of step 1104, the computer 30 has determined, for each network 2021, 2022, and 2023, the weights wij that connect each node/gene N (i.e., the respective variables p′1, p′2 and p′3) to the other nodes/genes N. For instance, subsequently these weights are identified via the matrices W1, W2, and W3 that have been determined, respectively, for the data of the matrices S1, S2, and S3, i.e., respectively, for the networks 2021, 2022, and 2023.
In a step 1106, the computer 30 moreover determines for each gene/node N of a network 200, a measurement of similarity/correlation sij between the node N and all the nodes N of the other networks, for example between the node N1 of the network 202, and all the nodes Ni of the network 2022 and likewise all the nodes Ni of the network 2023, for example thus generating:
In particular, the inventors have noted that the computer can use, for generation of the matrix S12 (i.e., of the respective measurements of similarity) between the gene expression and the copy number variations, the biweight-midcorrelation metric. Instead, the computer can use for generation of the other matrices S13 and S23 the point-biserial correlation. Also this correlation measurement is in itself well known. For instance, for this purpose, the corresponding Wikipedia webpage “Point-biserial correlation coefficient” (available at the link https://en.wikipedia.org/wiki/Point-biserial correlation coefficient) may be cited.
Once the computer 30 has determined the measurements of similarity of the nodes between the different networks, for example the matrices S12, S13, and S23, the computer 30 then calculates for each measurement of similarity sij the respective weight wij. For instance, subsequently these weights are identified via matrices W12, W13, and W23 that have been determined, respectively, for the data of the matrices S12, S13, and S23. For instance, in various embodiments, the computer uses for this purpose again Eq. (1). For instance, in various embodiments, the exponent for the transformation between matrices S12, S13, and S23 and the matrices W12, W13, and W23 is chosen between 2 and 10, preferably between 2 and 5, for example, 4.
Consequently, by the end of step 1104, the computer 30 has determined the weights W1, W1, and W2 that connect the nodes N at the network level, i.e., the intra-network weights, and by the end of step 1106 the weights W12, W13, and W23 that connect the nodes N to the nodes N of the other networks, i.e., the inter-network weights. In general, approximately 50% of the data are redundant since the weight wij (and likewise the measurement of similarity sij) between the node Ni and the node Nj has the same value as the weight wji between the node Nj and the node Ni. Consequently, these data can be stored also in any other form (for example, in the form of a list), which enables determination of the weight w(; between two nodes Ni and Nj, i.e., genes, which may belong to one and the same network or to two different networks.
As explained previously, neglecting the redundancy of the data, each of these weight matrices hence has a size of p×p values. However, since the number m of reference patients PR is small, the networks obtained in the previous passages are considerably noisy. Consequently, in various embodiments, the computer 30 generates, in a step 1108, for each matrix W1, W2, W3, W12, W13, and W23 a respective matrix W′, i.e., matrices W′1, W′2, W′3, W′12, W′13, and W′23, using a backboning procedure configured for extracting the latent structure through pruning/removing of the non-salient edges, i.e., connections between two nodes.
In particular, the inventors have noted that the computer 30 can use in step 1108 the NC (Noise-Corrected) method. The respective method is described in the article by Coscia M., Neffke F. “Network Backboning with Noisy Data”, arXiv:1701.07336 [physics.soc-ph]. This method basically consists of three passages:
For instance, the NC method forms part of the software module “Network-Backboning” of Michele Coscia published in Python and available, for example, at https://www.nichelecoscia.com/?page_id=287, the contents of which are incorporated herein for this purpose. In general, even though experimental tests have highlighted that NC yields the best result, other backboning methods may also be used, such as other methods supported by the aformentioned software module, e.g.:
In particular, in various embodiments, the computer 30 prunes, for all the intra-omics networks (W1, W2, and W3) and inter-omics networks (W12, W13, and W23), the branches using a threshold that corresponds to a p-value of 0.05. Consequently, at the end of step 1108, the computer has pruned all the non-salient edges, maintaining only the latent structure of each network 2021, 2022, 2023. For instance, for this purpose the computer can generate matrices W′1, W′2, W′3, W′12, W′13, and W′23, where the values of the non-salient edges/weights of the matrices W1, W2, W3, W12, W13, and W23, have been set at 0.
In various embodiments, the computer then joins, in a step 1110, all the (e.g., six) interaction networks W′1, W′2, W′3, W′12, W′13, and W′23 obtained in step 1108 in a single multi-layer network, where each layer of the network represents an omics (mRNA gene expression, copy number variations, and mutations), and the inter-omics networks connect the different layers.
Operation of steps 1104-1110 is also illustrated in
Typically, the resulting network 204 comprises nodes with a reduced number of connections. Consequently, in various embodiments, the computer 30 determines, in a step 1112, sub-networks, the so-called communities, of the multilayer network. For instance, in various embodiments, the computer uses for this purpose the Infomap method in the multilayer version. The Infomap software package is in itself well-known in the technical field and is described, for example, in D. Edler and M. Rosvall, “The MapEquation software package”, https://mapequation.github.io/infomap/(2014-2019). Since the Infomap algorithm is stochastic and iterative, the computer 30 performs a given number of Infomap executions, for example chosen between 20 and 500, for example between 50 and 200, for example 100 executions, and the final structure of the communities is the result at the end of these Infomap executions. Basically, the computer calculates, for each execution/iteration, a measurement of the mean description length per step of a random walker that moves along the (weighted) connections/edges between the nodes of the network. For each iteration, a new partitioning of the network into communities is recorded if the description length is shorter than the previous shortest description length.
In various embodiments, the computer 30 then generates a list of communities 206 by selecting only the communities determined by Infomap that comprise a plurality of nodes. For instance, with reference to the dataset described previously, the computer 30 can extract 217 communities (with a number of nodes) for the p=15736 genes. In particular, thanks to the use of a multilayer network 204, the communities may comprise nodes of different networks 202, i.e., information on the genes from different omics.
In general, even though experimental tests have highlighted that Infomap yields the best result, other methods for determining the communities/sub-networks may also be used, such as: improved label propagation algorithm (e.g., Li H, Zhang R, Zhao Z, Liu X, “LPA-MNI: An Improved Label Propagation Algorithm Based on Modularity and Node Importance for Community Detection”, Entropy (Base1), 2021 Apr. 21, 23(5):497. doi: 10.3390/e23050497); nonnegative matrix factorization (e.g., see Wikipedia webpage “Non-negative matrix factorization”; available at https://en.wikipedia.org/wiki/Non-negative_ matrix_factorization), etc.
Once the computer 30 has obtained the list of communities 206, it performs, in a step 1114, a feature-extraction operation, where the computer 30 extracts from each community in the list 206 one or more synthetic variables, where the information of interrelation between the nodes of the communities, i.e., of the data that may refer to different genes and/or different omics, is synthesized.
Consequently, in various embodiments, the computer 30 uses, in step 1112, a method of dimensionality reduction that receives as input the values that each gene has in each omics belonging to the community and returns at output one (or more) synthetic variables, the so-called features F. For instance, in various embodiments, the computer can generate one or more features F using, for this purpose, the UMAP (Uniform Manifold Approximation and Projection) method and the corresponding software package as nonlinear dimensionality-reduction technique.
For instance, this is illustrated in
In general, even though experimental tests have highlighted that UMAP yields the best result, other methods of feature extraction/dimensional reduction may also be used.
Consequently, the computer generates for each community a dataset 208 that comprises the respective data of the datasets 200′1, 200′2, and 200′3, and uses the feature-extraction algorithm, e.g., UMAP, to generate one or more features F. In particular, in various embodiments, the computer 30 is configured for dividing the variables of a community into a first subset of variables that comprises the variables of the community with continuous values (in particular, the gene expressions p1 and the copy number variations p2) and a second subset of variables that comprises the variables of the community with discrete values (in particular, the mutations p3). In this case, the computer can then generate for each subset of variables a respective dataset 208, which comprises the respective data of the datasets 200′1, 200′2, and 200′3, and use the feature-extraction algorithm to generate one or more features F for the respective subset of variables, for example one or more features F1 for the first subset of variables and one or more features F2 for the second subset of variables.
In the embodiment considered, the computer 30 stores, in step 1114, also the mapping rules RF used for generating each of the features F as a function of the respective variables p′1, p′2, and/or p′3. For instance, assuming that the computer has identified 217 communities, and the variables (p′1, p′2, and/or p′3) that belong to each community are transformed via a respective rule RF into two features F1 and F2, the solution here proposed obtains a total of 434 features F for the 47208 (15736×3) initial variables.
Consequently, in a step 1116, the computer generates a training dataset 210 by using the mapping rules RF for calculating, for each reference patient PR, the values of the features F as a function of the respective data 200 of the reference patient PR (see
Consequently, in a step 1118, the computer can use the training dataset 210 for training a classifier, such as a machine-learning algorithm, which is able to estimate the value of the variable of interest v as a function of the values of a given set of features F. For instance, the classifier may be an artificial neural network, a support-vector machine, or any other parameterized mathematical function configured for estimating the variable of interest v as a function of the values of the features F. In this case, the computer 30 can vary, in step 1118, the values of the parameters PC of the mathematical function in such a way as to reduce a cost function calculated on the basis of the differences between the estimates v′ of the variable v supplied by the mathematical function and the value of the variable v in the dataset 210. Finally, the training procedure 1100 terminates in an end step 1120. In this context, the Italian patent application No. 102022000001817, the contents of which are incorporated herein for reference, describes a machine-learning method that is able to estimate the disease-free survival time of a patient. In particular, this document proposes the use of PCA (Principal-Component Analysis) for extraction of features from the omics data. Consequently, the above PCA could be replaced with the solution described herein.
Consequently, the solutions described herein enable considerable reduction of the number of variables/features that must be taken into consideration by the classifier (steps 1118 and 1204). Moreover, irrespective of the specific variable of interest v, the method is able to identify communities of variables that are connected together. Consequently, by analysing, via one or more further feature-extraction algorithms, the link between the variable of interest v and each feature F, the computer 30 can determine the correlation between the variable of interest v and each community (and hence the respective variables p′1, p′2, and/or p′3), which hence enables analysis also of possible biological aspects that can explain the link between given genes and given diseases.
Of course, without prejudice to the underlying principles of the invention, the details of implementation and the embodiments may vary widely with respect to what has been described and illustrated herein purely by way of example, without thereby departing from the scope of the present invention as this is defined in the annexed claims.
Number | Date | Country | Kind |
---|---|---|---|
102022000005861 | Mar 2022 | IT | national |