The present invention relates to a method and system for predicting age based on analysis of various omics data, and more specifically, to a method for predicting aging or biological age using integrated information such as DNA methylation, mRNA expression level and telomere length, which acquires and comprehensively analyzes various omics data related to telomere length, DNA methylation, mRNA expression level, etc. from the specimen sample (e.g., blood) from a subject to predict the biological age of the subject and classify and analyze the degree of aging based on each omics data and a system for performing the same.
Biological age refers to the age quantified by comprehensively evaluating the overall health status and the degree of aging. In predicting such aging/biological age, the method using telomere length has been generally used. Biomarkers and combinations thereof are being developed to predict age based on DNA methylation or gene expression levels that change significantly with age.
The issue to be addressed by the present invention is to provide a method and system for predicting biological age based on various omics data analysis that can solve the problems of the prior art.
A system for predicting biological age based on various omics data analysis according to an embodiment of the present invention for addressing the above issues comprises: a test sample collection unit for collecting a plurality of genetic test samples, including at least one of DNA and RNA of a subject; a test sample analysis unit for analyzing a plurality of types of omics data from each of the plurality of genetic test samples; a preprocessing unit for preprocessing the omics data analyzed through the test sample analysis unit; an association analysis unit for performing an association analysis based on the omics type of data for each omics area converted through the preprocessing unit; and an age prediction unit for predicting the subject's age based on the analyzed result of the association analysis unit and the data for each omics area.
A method for predicting biological age based on various omics data analysis according to an embodiment of the present invention for addressing the above issues comprises steps of collecting a plurality of genetic test samples, including at least one of DNA and RNA of a subject in a test sample collection unit; analyzing a plurality of types of omics data from each of the plurality of genetic test samples in a test sample analysis unit; preprocessing the omics data analyzed through the test sample analysis unit in a preprocessing unit; performing an association analysis based on each omics type of data for each omics area converted through the preprocessing unit in an association analysis unit; predicting the age of a subject based on the analysis result of the association analysis unit and the data for each omics area in the age prediction unit.
The method and system for predicting biological age based on various omics data analysis according to an embodiment of the present invention a reused to combine and reflect markers of various omics regions in the biological age prediction model, thereby having the advantage of being able to offset the existing error in individual omics area. It allows more accurate age prediction and distinguishing and interpreting the influence (or the degree of aging) of each omics area on the integratedly predicted biological age (the current degree of aging of the subject).
That is, through the combination of three omics data, such as the genome (telomere length), exogenous (methylation), and transcript (gene expression) of samples such as human blood: 1) the age prediction accuracy can be increased by offsetting the noise; 2) The biological age (degree of aging) of the subject can be analyzed by dividing it by omics area.
Hereinafter, embodiments of the present invention are described in detail with reference to the accompanying drawings so that those of ordinary skill in the art can easily carry out the present invention. However, the present invention may be embodied in several different forms and is not limited to the embodiments described herein. Further, in order to clearly explain the present invention in the drawings, parts irrelevant to the description are excluded, and similar reference numerals are assigned to similar parts throughout the specification.
Throughout the specification, when a part is “connected” with another part, it is not only “directly connected” but also “electrically connected” with another element interposed therebetween. Further, when a part “includes” a certain component, it means that other components may be further included, rather than excluding other components, unless otherwise stated, and it is to be understood that the existence or addition of one or more other features, numbers, steps, operations, components, parts, or combinations thereof is not precluded in advance.
The terms “about,” “substantially,” etc. related to the extent used throughout the specification are used in a sense at or close to the numerical value when the manufacturing and material tolerances inherent in the stated meaning are presented and are used to prevent an unscrupulous infringer from using the disclosure in which exact or absolute values are mentioned to aid the understanding of the present invention. As used throughout the specification of the present invention, the term “step of (to)” or “step of” does not mean “step for.”
In this specification, a “unit” includes a unit implemented by hardware, a unit implemented by software, and a unit realized using both. In addition, one unit may be implemented using two or more hardware, and two or more units may be implemented with one hardware.
In this specification, some of the operations or functions described as being performed by the terminal, apparatus, or device may be performed instead of in a server connected to the terminal, apparatus, or device. Similarly, some of the operations or functions described as being performed by the server may also be performed in a terminal, apparatus, or device connected to the server.
In this specification, some of the operations or functions described as mapping or matching with the terminal may be interpreted to mean mapping or matching the terminal's unique number or personal identification information, which is identifying data of the terminal.
Hereinafter, the present invention is described in detail with reference to the accompanying drawings.
First, as shown in
More specifically, the system for predicting biological age 100 based on the analysis of various omics data of the present invention is a test sample collection unit 110, a test sample analysis unit 120, a preprocessing unit 130, an association analysis unit 140, a weight allocation unit 150, a weight correction unit 160, and an age prediction unit 170.
The test sample collection unit 110 includes a configuration for collecting a plurality of genetic test samples containing the DNA and RNA of the subject, that is, a configuration for collecting and then classifying a plurality of aging biomarker test samples.
Here, the aging biomarker may be information on measuring telomere length by collecting DNA from blood samples of various age groups, performing DMR analysis through methyl-seq or chip experiment, and performing DEG analysis through RNA-seq or microarray experiments on collected RNA and the like.
Further, each aging biomarker test sample is classified into learning data and test data, and the classified learning data and test data are used for age prediction of each omics area marker, integrated omics analysis, and predicted age weight summation analysis.
Next, the test sample analysis unit 120 is configured to analyze a plurality of types of omics data from each of the plurality of genetic test samples. That is, it may be a configuration for analyzing an omics area, including telomere length information, DNA methylation information, and gene expression level for each gene from the plurality of genetic test samples.
Specifically, the test sample analysis unit 120 comprises a telomere length measurement unit, a methylation marker analysis and filtering unit, and a gene expression marker analysis and filtering unit.
The telomere length measurement unit is configured to measure the relative length of telomeres compared to a single copy gene using the qPCR, TRF, or Q-FISH method, in which the fluorescence detection limit cycle number (Ct) for each concentration is measured from a standard oligomer sample of known length. Then the total telomere length is obtained by dividing the Ct value of the telomeres by the Ct value of the reference gene, and the absolute length of the telomeres is measured by dividing this by the number of telomeres in the human genome.
For reference, the above-described method for measuring telomere length is only an example, and various conventional methods for measuring telomere length may be applied.
Next, the methylation marker analysis and filtering unit map the methylation raw data obtained using DMR analysis, etc. through experiments such as Methyl-seq, chip, etc. to a human genome map (human reference), thereby obtaining the methylation degree (hereinafter, “beta value”) by location of each test sample and selects areas in which beta values increase or decrease according to age in each test sample using DMR analysis.
Next, the gene expression marker analysis and filtering unit map the gene expression raw data obtained through experiments such as RNA-seq and microarray to the human genome map (human reference) to calculate the expression level for each gene in each test sample, remove the batch effect according to gender/lifestyle, etc. from the calculated gene expression level, and then select genes whose expression level increases or decreases according to age in each test sample using DEG analysis and the like.
Next, the preprocessing unit 130 is configured to perform preprocessing on the omics data analyzed through the test sample analysis unit 120.
More specifically, the preprocessing unit 130 converts beta values and expression level values of selected methylation markers and gene expression markers, and telomere length into percentiles in the range of 0 to 1 for the application of multiple linear regression analysis or artificial neural network-based regression analysis.
Next, the association analysis unit 140 performs an association analysis based on each omics type of data for each omics area converted through the preprocessing unit 130. More specifically, the association analysis unit 140 uses multiple linear regression analysis or artificial neural network-based regression analysis to calculate each coefficient value of the independent variable in a regression model with the preprocessed value of the biomarker for each omics area converted as an independent variable and biological age as a dependent variable. Through the calculated coefficients, the association between the biological age and the actual age for each area predicted from the preprocessed value of each omics area biomarker is analyzed, and the analyzed association may be one of the coefficients of determination (Rx2) significance (PVALx), and mean absolute error (MAEx).
Next, the weight allocation unit 150 may be configured to assign a weight to each type of omics data based on any one of the associations (coefficient of determination, significance, and mean absolute error) analyzed through the association analysis unit.
The weight allocation unit 150 calculates the weight (Wx) for the coefficient of determination (Rx2), significance (PVALx), and mean absolute error (MAEx) of each omics area using the following equations.
W
x=1/(1−Rx2) (Weight equation for coefficient of determination)
W
x=log(PVALx)*(−1) (Weight equation for significance)
W
x=1/maex,rev (Weight equation for mean absolute error)
The weight correction unit 160 may be configured to obtain a correction weight (Wx,rev) by exponentiating a weighted average value (Wavg) for each area of the weights given to each type of omics data using the following equations.
W
x,rev
=W
avg
(W
/Wavg) (Weight correction equation)
Meanwhile, the weight correction unit 160 may obtain distribution correction (maex,rev) by the average age (AGEavg) of the sample group to relatively reflect the mean absolute error compared to the actual age distribution before weight correction for the mean absolute error (MAEx) of each omics area through the following equation.
mae
x,rev
=MAE
x/AGEavg
Next, the age prediction unit 170 is configured to predict the subject's age based on the analysis result of the association analysis unit and the data for each omics area and may predict the subject's age through the following equation.
That is, the age prediction unit 170 is configured to calculate the weights of each omics area using any one of the coefficients of determination, significance, and mean absolute error for the age of individual omics data and then predict biological age or aging state by comparing the sum of the age inferred from the individual omics according to the weight.
Hereinafter, with reference to the drawings, the first comparative example compares the predictive power of the linear regression-based biological age to the actual age using telomere length, sixteen methylation markers, and eighteen gene expression markers through the configurations disclosed herein are briefly described.
1) Omics Integrated Multiple Linear Regression Analysis
In the first comparative example, the association analysis unit 140 of the present application performs multiple linear regression analysis and omics integration analysis for each area using sixteen methylation markers based on preselected adjusted p-value<1.0e-30 and eighteen gene expression markers based on adj.pval<5.0e-02 along with the telomere length.
2) Summation Analysis of Biological Age Weights by Omics Area (Weighted Coefficient of Determination)
The association analysis unit 140 of the present application obtains the coefficient of determination (Rx2) for the actual age of the sample of the biological age predicted for each area from multiple linear regression analysis using the markers of each omics area.
The weight allocation unit 150 of the present application calculates the weight (Wx) of each omics area as in Equation 1 in order to give greater weight to the omics region having a large coefficient of determination.
W
x=1/(1−Rx2) [Equation 1]
Further, when the difference in the coefficient of determination for the actual age of the biological age between each omics area is large, the weight correction unit 160 of the present application calculates a corrected weight value (Wx,rev) through exponentiation of the average weight value (Wavg) of each omics area as shown in Equation 2 in order to emphasize and reflect the age of the omics area with high reliability in the weight (Wx) for each area.
The age prediction unit 170 of the present application calculates an omics-integrated biological age (AGEinteg) by applying and summing a weight for each omics area to the predicted age (AGEx) for each area, as shown in Equation 3, and summing them.
Table 1 below shows the coefficient of determination, weight, and correction weight of each omics area and Table 2 compares the predicted value of omics-integrated biological age by summing the age for each omics area and weights for each omics area and the actual age.
Table 3 compares the omics integrated regression analysis of biological age and age-weighted summation of omics integrated biological age prediction results for each omics area compared to individual omics biological age prediction.
Referring to Table 3, it can be shown that the omics integrated biological age prediction by omics integrated regression analysis or age weight summation for each omics area is closer to the actual age of the sample in terms of coefficient of determination and significance, and the mean error (MAE) is smaller compared to the age-predicted through multiple linear regression analysis from individual omics.
Hereinafter, with reference to the drawings, the second comparative example comparing the predictive power of the linear regression-based biological age to the actual age using the telomere length, four methylation markers, and four gene expression markers through the configurations disclosed herein is briefly described.
1) Omics Integrated Multiple Linear Regression Analysis
In the second comparative example, the association analysis unit 140 of the present application performs multiple linear regression analysis and omics integration analysis for each area by selecting four methylation markers based on adj.Pval<1.0e-30 and the absolute value of the association between the marker and the actual age |R|>0.75 and four gene expression markers based on adj.Pval<1.0e-04 along with the telomere length.
2) Summation Analysis of Biological Age Weights by Omics Area (Weighted Coefficient of Determination)
It is applied in the same manner as in the first comparative example. Table 4 below shows the coefficient of determination, weight, and correction weight of each omics area and Table 5 compares the predicted value of omics-integrated biological age by summing the age for each omics area and weights for each omics area and the actual age.
Table 6 compares the omics integrated regression analysis of biological age and age-weighted summation of omics integrated biological age prediction results for each omics area compared to individual omics biological age prediction.
Referring to Table 6, it can be shown that the omics integrated biological age prediction by omics integrated regression analysis or age weight summation for each omics area is closer to the actual age of the sample in terms of coefficient of determination and significance, and the mean error (MAE) is smaller compared to the age-predicted through multiple linear regression analysis from individual omics.
Hereinafter, with reference to the drawings, the third comparative example comparing the predictive power of the artificial neural network-based biological age to the actual age using the telomere length, sixteen methylation markers, and eighteen gene expression markers through the configurations disclosed herein is briefly described.
1) Omics Integrated Artificial Neural Network-Based Regression Analysis
In the third comparative example, the association analysis unit 140 of the present application performs artificial neural network-based regression analysis and omics integration analysis for each area by selecting sixteen methylation markers based on adj.Pval<1.0e-30 and eighteen gene expression markers based on adj.Pval<5.0e-02 along with the telomere length.
2) Summation Analysis of Biological Age Weights by Omics Area (Weighted Coefficient of Determination)
It is applied in the same manner as in the first comparative example.
Table 7 below shows the coefficient of determination, weight, and correction weight of each omics area and Table 8 compares the predicted value of omics-integrated biological age by summing the age for each omics area and weights for each omics area and the actual age.
Table 9 compares the omics integrated regression analysis, and age-weighted summation omics integrated biological age prediction results compared to artificial neural network-based individual omics biological age prediction.
Referring to Table 9, it can be shown that the omics integrated biological age prediction by omics integrated regression analysis or age weight summation for each omics area is closer to the actual age of the sample in terms of coefficient of determination and significance, and the mean error (MAE) is smaller compared to the age predicted through artificial neural network-based regression analysis from individual omics.
Hereinafter, with reference to the drawings, the fourth comparative example compares the linear regression-based age prediction (weight scoring) using the telomere length, sixteen methylation markers, and eighteen gene expression markers through the configurations disclosed herein are described.
1) Omics Integrated Multiple Linear Regression Analysis
In the fourth comparative example, the association analysis unit 140 of the present application performs multiple linear regression analysis and omics integration analysis for each area by selecting sixteen methylation markers based on adjusted p-value<1.0e-30 and eighteen gene expression markers based on adj.pval<5.0e-02 along with the telomere length.
2-1) Summation Analysis of Biological Age Weights by Omics Area (Weighted Significance)
The association analysis unit 140 of the present application obtains the significance (PVALx) between the biological age predicted for each area (x) from multiple linear regression analysis using the markers of each omics area and the sample's actual age.
The weight allocation unit 150 of the present application calculates the weight (Wx) as in Equation 4 to transform the significance scale distributed in a very small error range.
W
x=log(PVALx)*(−1) [Equation 4]
Further, when the difference in the significance between the biological age and the actual age between each omics area is large, the weight correction unit 160 of the present application calculates a corrected weight value (Wx,rev) as shown in Equation 5 through exponentiation of the average weight value (Wavg) of each omics area in order to emphasize and reflect the age of the omics area with high reliability in the weight (Wx) for each area.
The age prediction unit 170 of the present application calculates the biological age (AGEinteg) by summing the weights for each omics region.
Table 10 below shows the significance, weight, and correction weight of each omics area, and Table 11 compares the predicted value of omics-integrated biological age by summing the age for each omics area and weights for each omics area and the actual age.
2-2) Summation Analysis of Biological Age Weights by Omics Area (Weighted Mean Error)
The association analysis unit 140 of the present application obtains mean absolute error (MAEx) between the biological age predicted for each area (x) from multiple linear regression analysis using the markers of each omics area and the sample's actual age.
The weight allocation unit 150 of the present application calculates the weight (Wx) of each omics area as in Equation 7 in order to give greater weight to the omics area with a small mean absolute error.
W
x=1/maex,rev [Equation 7]
Further, in order to relatively reflect the mean absolute error compared to the actual age distribution, distribution correction (maex,rev) by the average age (AGEavg) of the sample group is applied as shown in Equation 8 below, when the difference in the mean absolute error between the biological age and the actual age between each omics area is large, the weight correction unit 160 of the present application calculates a corrected weight value (Wx,rev) as shown in Equation 9 through exponentiation of the average weight value (Wavg) of each omics area in order to emphasize and reflect the age of the omics area with high reliability in the weight (Wx) for each area. Then, the integrated biological age (AGEinteg) is calculated by summing the correction weights for each omics area using Equation 10.
Table 12 below shows the mean absolute error. Correction means absolute error, weight, and correction weight of each omics area, and Table 13 compare the predicted value of omics-integrated biological age by summing the age for each omics area and weights for each omics area and the actual age.
Table 14 below compares the age-weighted summation of omics integrated biological age prediction results to which each weighting method is applied compared to individual omics biological age prediction.
Referring to Table 14, it can be seen that the omics-integrated biological age, which is weighted by scoring significance or mean error compared to the predicted age through regression analysis from individual omics, is closer to the actual age of the sample in terms of coefficient of determination and significance, and the mean error is smaller.
Hereinafter, a method for predicting biological age based on various omics data analysis according to the first embodiment of the present invention is described with reference to
The method S700 for predicting biological age based on various omics data analysis according to an embodiment of the present invention collects a plurality of genetic test samples, including DNA and RNA of a subject in the test sample collection unit 110 (S710), then analyzes a plurality of types of omics data (including at least one of telomere length, methylation, and gene expression) from each of the plurality of genetic test samples in the test sample analysis unit 120 (S720), and then preprocesses conversion of each marker value of the omics data analyzed through the test sample analysis unit 120 into a percentile value in the range of 0 to 1 in a preprocessing unit 130 (S730).
Thereafter, the method performs an association analysis based on the type of omics data for each omics area converted through the preprocessing unit 130 in the association analysis unit 140 (S740).
Process S740 is a process of analyzing the correlation between data for a plurality of omics areas using multiple linear regression analysis or artificial neural network-based regression analysis in which the analyzed correlation may be any one of the coefficients of determination (Rx2), significance (PVALx), and mean absolute error (MAEx)
Thereafter, the method predicts the subject's age based on the analysis result of the association analysis unit 140 and the data for each omics region in the age prediction unit 170 (S750).
Process S750 may be a process of predicting the subject's age by integrating (summing) the analysis result data for each of the plurality of types of analyzed omics areas.
Hereinafter, a method for predicting biological age based on various omics data analysis according to the second embodiment of the present invention is described with reference to
The method S800 for predicting biological age based on various omics data analysis according to an embodiment of the present invention collects a plurality of genetic test samples, including DNA and RNA of a subject in the test sample collection unit 110 (S810), then analyzes a plurality of types of omics data (including at least one of telomere length, methylation, and gene expression) from each of the plurality of genetic test samples in the test sample analysis unit 120 (S820), and then preprocesses conversion of each marker value of the omics data analyzed through the test sample analysis unit 120 into a percentile value in the range of 0 to 1 in a preprocessing unit 130 (S830).
Thereafter, the method performs an association analysis based on the type of omics data for each omics area converted through the preprocessing unit 130 in the association analysis unit 140 (S840).
Process S840 is a process of analyzing the correlation between data for a plurality of omics areas using multiple linear regression analysis or artificial neural network-based regression analysis in which the analyzed correlation may be any one of the coefficients of determination (Rx2) significance (PVALx), and mean absolute error (MAEx)
When process S840 is completed, the method assigns a weight to each type of omics data based on any one of the associations (coefficient of determination, significance, and mean absolute error) analyzed through the association analysis unit in the weight allocation unit 150 (S850).
Here, the weight allocation unit 150 calculates the weight (Wx) for the coefficient of determination (Rx2), significance (PVALx), and mean absolute error (MAEx) of each omics area using the following equations.
W
x=1/(1−Rx2) (Weight equation for coefficient of determination)
W
x=log(PVALx)*(−1) (Weight equation for significance)
W
x=1/maex,rev (Weight equation for mean absolute error)
When process S850 is completed, the weight correction unit 160 may be configured to obtain a correction weight (Wx,rev) by exponentiating a weighted average value (Wavg) for each area of the weights given to each type of omics data using the following equations (S760).
W
x,rev
=W
avg
(Wx/Wavg)
Meanwhile, the weight correction unit 160 may obtain distribution correction (maex,rev) by the average age (AGEavg) of the sample group to relatively reflect the mean absolute error compared to the actual age distribution before weight correction for the mean absolute error (MAEx) of each omics area through the following equation.
mae
x,rev
=MAE
x/AGEavg
When process S860 is completed, the age prediction unit 170 predicts the subject's age based on the analysis result of the association analysis unit and the data for each omics area, and the subject's age is predicted through the following equation (S870).
Therefore, an embodiment of the present invention is used to combine and reflect markers of several omics areas in the biological age prediction model, thereby offsetting errors existing in individual omics area and allowing more accurate biological age prediction and interpreting them by dividing the influence (or aging state) of each omics area with respect to the predicted biological age (the current aging state of the subject).
For example, through a combination of three omics data, including the genome (telomere length), exogenous (methylation), and transcript (gene expression) of samples such as human blood, 1) the age prediction accuracy can be improved by canceling the noise inherent in each omics data, and 2) the biological age (degree of aging) of the subject can be analyzed separately for each omics.
The above description of the present invention is for illustration, and those of ordinary skill in the art to which the present invention pertains can understand that it can be easily modified into other specific forms without changing the technical spirit or essential features of the present invention. Therefore, it should be understood that the embodiments described above are illustrative in all respects and not restrictive. For example, each component described as a single type may be implemented in a dispersed form, and likewise components described as distributed may be implemented in a combined form.
The scope of the present invention is indicated by the following claims rather than the above detailed description, and all changes or modifications derived from the meaning and scope of the claims and their equivalents should be construed as being included in the scope of the present invention.
Number | Date | Country | Kind |
---|---|---|---|
10-2020-0045382 | Apr 2020 | KR | national |
This application is a continuation of International Patent Application No. PCT/KR2021/004293, filed on Apr. 6, 2021, which claims priority to Korean Patent Application No. 10-2020-0045382 filed in the Korean Intellectual Property Office on Apr. 14, 2020, the disclosures of which are incorporated by reference herein in their entireties.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/KR2021/004293 | Apr 2021 | US |
Child | 17965945 | US |