The present invention relates to a data analysis method and system for disease diagnosis aid, and more particularly, to a technique and system capable of providing analysis results through integrated analysis of clinical, MRI images, and genomic data in order for disease diagnosis aid.
Existing systems to aid in the diagnosis of a disease require phenotypic data and genotypic data, and analyze the data and provide a service that recommends the candidate disease name of the patient. Examples of systems that provide the service include Phenomizer, GenIO, and PhenoVar.
Phenomizer provides the function to show a candidate disease list with high correlation with the patient's phenotypic data by calculating the similarity between the patient's phenotypic data and the phenotypic data provided from the published disease database. However, since only a function for predicting a candidate disease list by using only the phenotypic data of the patient is provided in the case of the phenomizer, there is a disadvantage in that additional tools or systems are required to be used together with the actual patient's genetic data.
GenIO is a system developed to assist in the diagnosis process for rare genetic diseases, and provides services to find disease-causing variants of patients after analyzing clinical data and genotypic data. In order to provide the service, GenIO uses a program called Phenolyzer to obtain a candidate gene list associated with the inputted phenotypic data and find the variant that causes the patient's disease through filtering the genotypic data of the input patient based on the information and classification work according to mode of inheritance, pathogenicity, etc. However, in the case of the system, the size of the analysis and usable genotypic data is limited to 200 MB, and both the clinical and genotypic data are essential for data analysis. In addition, since a list of variants that cause a patient's disease is provided as a result of analysis, there is a disadvantage in that additional effort is required to find information on the variant in order to utilize it in actual diagnosis.
PhenoVar is also a system designed to achieve the goal of helping healthcare professionals to diagnose patients and the corresponding system provides a service that predicts a candidate disease of a real patient using clinical and genotypic data. PhenoVar uses an algorithm to quantify the association with specific diseases for each clinical and genotypic data to calculate the weight representing the association with a specific disease according to each data type, integrates the calculated weights, and provides a candidate disease list based on the final diagnostic score calculated for each disease. However, PhenoVar has several drawbacks. It is designed to input only the information belonging to several sub-categories provided by PhenoVar when inputting the patient's phenotypic data so that the phenotypic data available is limited. In addition, the local database used for clinical data analysis has a limitation that most of them are simulated patient's phenotypic data based on published disease related databases rather than actual patient data. In addition, like the GenIO system, the system has the disadvantage of requiring clinical and genotypic data.
As described above, most of the existing systems are developed based on analysis methods using clinical and genotypic data, and there are limitations in available input data formats or sizes. Furthermore, most existing systems require the input of specific data formats for analysis. Due to these problems, it is inconvenient for clinicians to use the system as an aid tool for patient diagnosis in a real clinical environment. For example, when presenting indirect evidence as a result, rather than direct evidence of a patient's candidate disease, or when considering additional types of data that are not supported by the existing system to use it for diagnosis, additional efforts and tools are needed to process the data. In addition, when there is no data required by the system, there is also a problem that the service provided by the system cannot be used.
Therefore, a system that provides services for aid in precise diagnosis of patients requires a system having no particular limitation on the input data format and including an integrated analysis method according to various input data.
Accordingly, the present invention is to solve the above problems, and aims to develop and construct a system including an analysis method capable of integrating genomic, clinical, and MRI data for disease diagnosis aid.
A method for analyzing data for disease diagnosis aid according to an embodiment of the present invention for solving the above problems may include receiving, by a processor of a computer, medical data of a subject; selecting, by the processor, disease-related data using the medical data; and calculating, by the processor, a disease probability according to the selected disease-related data, wherein the medical data may include at least two or more of clinical records, genes and genetic variants, or MRI.
According to an embodiment of the present invention, the selecting of the disease-related data may include selecting a genome variant having a possibility of disease association among all genes and gene variants of the subject.
According to an embodiment of the present invention, the calculating of the disease probability may include: calculating a probability that the gene and gene variants selected by the processor are disease-related information; calculating an average rank of the selected genes according to the probability; and calculating a disease gene probability according to the number of disease candidate genes of the subject.
According to an embodiment of the present invention, the selecting of the disease-related data may include selecting a volume value of the MRI, a white matter damage volume value, a cortical and subcortical region T2 high signal damage volume value, and a myelination index, and the calculating of the probability may include: calculating the selected data and data of MRI of a previously stored disease-specific target case as a vector-based similarity percentile; and calculating an average value of the similarity percentiles.
According to an embodiment of the present invention, the calculating of the probability may include: evaluating a phenotype based similarity of the clinical information; and calculating a disease probability according to the similarity.
A data analysis system for disease diagnosis aid according to an embodiment of the present invention for solving the above problems may include: an input unit configured to receive medical data of a subject; a selection unit configured to select disease-related data using the medical data; and a disease detection unit configured to calculate a disease probability according to the selected disease-related data, wherein the medical data may include at least two or more of clinical records, genes and genetic variants, or MRI.
According to an embodiment of the present invention, the selection unit may select a genome variant having a possibility of disease association among all genes and gene variants of the subject.
According to an embodiment of the present invention, the calculating of the disease probability may include calculating a probability that the gene and gene variants selected by the processor are disease-related information; calculating an average rank of the selected genes according to the probability; and calculating a disease gene probability according to the number of disease candidate genes of the subject.
According to an embodiment of the present invention, the selection unit may select a volume value of the MRI, a white matter damage volume value, a cortical and subcortical region T2 high signal damage volume value, and a myelination index, and the disease detection unit may calculate the selected data and data of MRI of a previously stored disease-specific target case as a vector-based similarity percentile, and calculate an average value of the similarity percentiles.
According to an embodiment of the present invention, the disease detection unit may evaluate a phenotype based similarity of the clinical information, and calculate a disease probability according to the similarity.
According to the invention, it is possible to provide an integrated database that can utilize data from disease cohorts and published databases created through actual research and, based on this, obtain data that can be used when analyzing various types of patient data.
In addition, the present invention provides an analysis method including a method for quantitatively evaluating patient data of various types and capable of selectively combining and analyzing various types of patient data.
According to the above-described database and analysis method, a system usable in various clinical environments can be provided. In addition, the system provides a service that can shorten patient diagnosis time for clinicians based on various patient data.
Moreover, the effects of the present invention are not limited to the effects mentioned above, and various effects can be included within the scope of what will be apparent to a person skilled in the art from the following description.
Hereinafter, a ‘data analysis method and system for disease diagnosis aid’ according to the present invention will be described in detail with reference to the accompanying drawings. The described embodiments are provided so that those skilled in the art can easily understand the technical spirit of the present invention, and the present invention is not limited thereby. In addition, matters expressed in the accompanying drawings may be different from shapes actually implemented as schematic drawings to easily describe embodiments of the present invention.
Moreover, each component represented below is only an example for implementing the present invention. Accordingly, other components may be used in other implementations of the present invention without departing from the spirit and scope of the present invention.
In addition, each component may be implemented solely in the configuration of hardware or software, but may also be implemented in a combination of various hardware and software components that perform the same function. Also, two or more components may be implemented together by one hardware or software.
In addition, the expression ‘includes’ certain components, as an expression of ‘open’, simply refers to the existence of the components, and should not be understood as excluding additional components.
Referring to
The data analysis system 100 for disease diagnosis aid according to an embodiment of the present invention can solve the limitations of the database used in the analysis of several existing systems by using a separate database 30 created with reference to a public database related to diseases that provide clinical, genomic, MRI data and information related to developmental disorders of actual patients diagnosed with brain neurological developmental disorders.
By inventing independent analysis methods of genomic, clinical, and MRI data and methods of integrating and analyzing results from the process, it is possible to solve the limited input data format problem of existing systems.
The corresponding system was developed to provide services to perform an aid role in the accurate diagnosis process of patients who are expected to suffer from diseases of the brain nervous system development, and provides the function to search the candidate disease list of the patient by analyzing genomic, clinical, and MRI data for the corresponding service.
The system described above may include a data analysis program for performing the corresponding function as shown in
The database 30 of the above-described system includes three types of data to store Evidence information of clinical and causal genes of diseases associated with diseases of the brain and nervous system development disorders necessary for performing a search function for a candidate disease of a patient. One can include data for storing data of Evidence created based on Human Phenotype Ontololgy (HPO), The Development Disorder Genotype-Phenotype Database (DDG2P), and Online Mendelian Inheritance in Man (OMIM), that is, the public database 31, and Evidence created based on clinical, genomic, and MRI data, which are the data of patients diagnosed with real brain nervous system development disorder.
HPO, used in Evidence information based on public databases, is a project that provides vocabulary for standardizing clinical data occurring from human disease, and as part of the corresponding project, provides standardized clinical data and a database containing information on diseases related to clinical data, and HPO included in the above-described database includes clinical and genetic information associated with OMIM-based cerebral nervous system development disorder diseases, including information on genetic diseases and clinical data stored in basic standardized terms. In addition, in order to quantitatively view differences in clinical data, several pieces of information for utilizing ontology-based similarity evaluation are added together and stored.
DDG2P is part of the Deciphering Developmental Disorders (DDD) project to analyze and study genomic and clinical data of children and parents with developmental disabilities in the UK and may provide standardized forms of clinical data in terms of disease-causing genes for developmental disorders and HPO terms observed in patients with actual diagnosis. The above-described database may include data such as clinical data, disease-causing genes, and mode of inheritance for brain neurological development diseases provided by DDG2P.
A wide range of information on the causative genes, genetic patterns, and patient symptoms reported so far for each hereditary disease based on previous reports of hereditary diseases is included in OMIM.
The above-described database may include clinical, genomic, and MRI data 32 of patients with actual brain neurological disease diagnosis. The actual patient's clinical data 32 may include the diagnosis name, disease cause gene, variant information, observed clinical abnormality of the patient in HPO terminology, and the like. The actual patient genotypic data contains variant information that causes the patient's disease, and actual patient MRI data may store information on brain structure features derived through data processing and analysis except for some very characteristic cases, due to the structure that is not accurate and detailed to describe in HPO
The above-described database may include a portion for storing evidence data for each inputable data and patient analysis results to search for a candidate disease of a patient based on an analysis result considering one or more input data.
The data analysis program of the above-described system may include a function of analyzing and storing a patient's data inputted by a clinician in an analytically usable form and combining and analyzing the results of each analyzed data.
The data analysis program 41 may be stored in the user terminal 20 or may be stored in the server 40. When the data analysis program 41 is stored in the server 40, data processing may be requested through communication.
Analysis methods using genomic and clinical data are applied to existing systems to aid patient's precise diagnosis. In addition, in order to use the corresponding analysis method, there are problems such as a limitation in which utilization data for each system is required essentially or is required essentially when inputting specific data. However, the data analysis program described above includes an analysis method that can additionally utilize MRI data in addition to the data used by the existing system, and a function to combine and analyze the analysis results, and the functions for processing and analyzing each data format are modularized. This analysis method and structure has a distinct advantage from the existing system. Unlike the existing system, the data analysis program having the analysis method and structure described above allows medical workers to directly select data available for patient diagnosis, and by providing a data processing and analysis method according to the selected data, the system described above can provide a service that can be used in various clinical environments.
The data analysis program 41 according to an embodiment of the present invention may calculate disease similarity to actual patient data inputted to the system by using the genomic DB, clinical DB, and MRI DB stored in the public database 31 and the actual patient clinical database 32
Referring to
The input unit 210 may receive medical data of an examinee. The medical data received by the input unit 210 may include clinical records, genes and genetic variants, and MRI. The data may be inputted in a computer-readable form. The input unit 210 may preprocess the medical data in a form that can be processed by the selection unit 220 or the disease detection unit 230 and transfer the preprocessed medical data.
The selection unit 220 may receive the medical data from the input unit 210. The selection unit 220 may select disease-related data using the medical data. Information included in the medical data may be selected.
The selection unit 220 may select a variant having a possibility of disease association among all gene variants possessed by the subject. The selection unit 220 may select the subject's brain region volume value, white matter damage volume value, cortex and subcortical region T2 high signal damage volume value, and myelination index from MRI data.
The disease detection unit 230 may calculate the disease probability according to the selected disease-related data. The disease detection unit 230 may provide an expected disease according to the probability of the disease. The disease detection unit 230 may calculate a disease probability according to a plurality of types of the disease-related data, and may determine a disease probability or a predicted disease in consideration of the calculated multiple disease probability.
The disease detection unit 230 synthesizes the results of pathogenicity prediction tools to determine the probability that the variant vj is a pathogenic variant with respect to the selected genome variants and calculates it as P(vj=pathogenic variant|prediction result of pathogenicity of vj).
If the gene gi variant has multiple vj, the disease detection unit 230 may obtain the disease gene probability of this gene gi as the maximum value of the pathogenic variant probability of each variant as follows. P(gi=disease gene)=max(P(vj=pathogenic variant|pathogenicity prediction result of vj))
The disease detection unit 230 may obtain the average rank ri of the disease gene probability P(gi=disease gene) of each gene gi, for the disease candidate genes possessed by the subject.
If the disease candidate genes possessed by the subject are N, the disease detection unit 230 may calculate the normalized disease gene probability PN(gi=disease gene) of the gene gi as shown in Equation 1 below.
1−(ri−1)/max(ri) [Equation 1]
If the disease gene specified in the Evidence is gk, since it is clear that this Evidence disease gene is gk, the disease detection unit 230 may assume that the normalized disease gene probability is 1. At this time, the genetic variant based similarity between the patient and this Evidence can be determined as min(PN(gk=disease gene), 1). 1) However, this is the case where the patient's allelic status and genetic pattern of the gene gk variant are consistent with those specified in Evidence, otherwise the similarity can be determined as 0. If the comparison with the subject is another patient B whose disease gene has not been identified, the genetic variant similarity in terms of gk between the two patients can be determined as follows. min(PN(gk=disease gene), PNB(gk=disease gene)).
The disease detection unit 230 may satisfy all of the following criteria for variants that may be associated with disease. 1) located in the exonic or splicing region, 2) should not be a synonymous variant, 3) the frequency of detection is less than 0.5% in all known population cohorts. It should be listed as a disease-causing gene in OMIM, and the allelic status of the variant should be consistent with the genetic pattern of the corresponding disease.
In order to calculate the pathogenic probability of each variant, the disease detection unit 230 may utilize the pathogenicity information of the previously reported disease gene variant DB, ClinVar, and prediction information of the following pathogenicity prediction tools: SIFT, Polyphen2, LRT, MutationTaster, MutationAssessor, FATHMM, RadialSVM, LR.
In the disease detection unit 230, the probability P(vj=pathogenic variant|pathogenicity prediction result of vj) that a variant vj is a pathogenic variant can be obtained by averaging Pt(vj=pathogenicity prediction result of vj by pathogenic variant|t) obtained by each prediction tool t. At this time, Pt(vj=pathogenicity prediction result of vj by pathogenic variant|t) can be calculated as follows by Bayes' theorem. Pt(vj=pathogenic variant|pathogenicity prediction result of vj by t)=Pt(pathogenicity prediction result of vj by t|vj=pathogenic variant)×P(vj=pathogenic variant)/P(pathogenicity prediction result of vj by t)
Pt(pathogenicity prediction result of vj by t|vj=pathogenic variant) for use in the above calculation can be calculated by assuming that the older version of the two versions of ClinVar having different differences is prediction and the latest version is actual variant information. This calculation can be done by learning the naive Bayes classifier by using each gene variant described in ClinVar as one input data and using the prediction information of pathogenicity prediction tools for the corresponding gene variant as a parameter constituting the corresponding input data.
P(vj=pathogenic variant) and P(pathogenicity prediction result of vj by t) can be estimated from 69,499,850 gene variants present in a total of 127 patient whole exome-sequencing data.
The disease detection unit 230 may calculate the similarity through the phenotype-based similarity evaluation of the clinical information. The disease detection unit 230 may calculate a disease probability using the similarity. The disease detection unit 230 may present the predicted disease using the similarity or disease probability.
The disease detection unit 230 may calculate the similarity through a total of 35 phenotype term list-to-term list similarity calculation techniques according to a combination of seven phenotype term-to-term similarity evaluation techniques secured by software libraries, such as Resnick, Lin, Jiang-Conrath, relevance, information coefficient, graph IC, and Wang, and five similarity combining techniques that can be used for term set-to-term set similarity calculation, such as Max, Mean, funSim Max, FunSimAvg, and BMA.
In order to find the best of 35 similarity evaluation techniques, based on 151 patients' disease information and phenotype, the disease detection unit 230 may evaluate the ranking of the same disease by calculating phenotype similarity for other cases of each case through a leave-one-out cross-validation method.
The disease detection unit 230 may calculate a percentile of the vector-based similarity of each of the disease-related data classifications selected from MRI data of the subject and MRI data of comparison cases, and may obtain the average value of the percentile similarity calculated for each classification.
The disease detection unit 230 may obtain an average rank ri between the input case and the comparison target data based on the calculated average value of the similarity percentile, and based on this, may finally calculate the normalized similarity value 1−(ri−1)/max(ri).
The disease detection unit 230 may calculate normalized similarity values of input patient data and reference data (e.g., SNU cohort or DDD project data) in the platform for each data type through the above processes.
When all or a part of the similarity for each data type is selected and combined, the disease detection unit 230 may calculate a general similarity as an average of corresponding normalized similarity values.
Referring to
The above-described program may perform an annotation operation (S131) for adding information on a variant using an input VCF file, and at this time, generate a result file in text format separated by tabs including information on the gene of the variant, frequency of the population level, the region of the variant, and pathogenic scores using the ANNOVAR program. Thereafter, additional information annotation and filtering operations may be performed using the result file generated by the annotation process (S133). The Filtering & Tiering process described above may use OMIM, a database of disease genes not supported by the ANNOVAR program, various logical expressions developed in-house to process genotype of variant, and Germline Variant Annotation Filtering (GVAF) software that provides annotation function based on text file rather than VCF format and genetic variant filtering through a combination of logical expressions, and additionally annotate disease information based on the genetic information of the variant using the corresponding software. In order to find disease-causing variants it is possible to filter and extract the variants that satisfy the conditions of the variants present in the exon or splicing region, which are observed with frequencies less than 0.05% of the database, in various Population levels.
Variants extracted by the filtering process can be classified according to the classification conditions of whether the variant is a direct disease cause or whether it is a variant of an existing disease-causing gene (S135).
Expected pathogenic variant process (S135) finds a variant that can cause disease after calculating the pathogenic score of the variant selected by the Filtering & Tiering process. Based on various variant information, including information generated by the process of expected pathogenic variants, by calculating the similarity with the Evidence stored in the database described above, quantitative evaluation of genotypic data between the input patient and Evidence can be performed (S137).
The process of quantitatively evaluating the similarity between the input patient data and the Evidence (S137) may calculate the similarity by comparing the Evidence information stored in the database with the genomic variant that causes the predicted disease.
Referring to
The above-described program may analyze phenotypic data using an ontology-based similarity evaluation method, and obtain a term-term similarity by using information on the relationship between terms in the corresponding similarity evaluation method. For this, a preprocessing process for analyzing the input phenotypic data is performed (S141). The preprocessing process (S141) changes the data type for quantitative evaluation of actual phenotypic data, and the corresponding process is to change the data inputted in the form of HPO Term name into the form of HPO Term ID. For example, when the inputted phenotypic data is “Focal seizures, Global developmental delay, Intellectual disability”, it is changed to the corresponding HPO Term ID “0007359, 0001263, 0001249” corresponding to the corresponding HPO Term name that is converted through the preprocessing process.
The phenotypic data changed to the HPO Term ID is used as a self-developed program to calculate the similarity to the phenotypic data of Evidence stored in the above-described database, thereby performing quantitative evaluation of phenotypic data between the input patient and the Evidence (S143).
The similarity evaluation process (S143) between the input patient data and the Evidence data may calculate the similarity between the preprocessed clinical data of the patient and the Evidence data stored in the database.
Referring to
The image data obtained by the preprocessing process (S151) is used to derive direct attribute values related to brain neurological diseases and brain functional damage (S153).
By using software to extract features of the brain structure, data such as the volume of normal gray matter and white matter, the volume of the damaged white matter lesion, cortical thickness, cortical area, and curvature are derived (S153).
The derived attribute values are used to calculate the similarity of comparing with MRI data of actual patients stored in the above-described database. By calculating the similarity between the attribute value for the brain structure characteristic of the input patient and the Evidence data, quantitative evaluation of MRI data between the input patient and Evidence is performed (S155).
The analysis method includes a method of combining results evaluated by a data analysis process, and various patient data can be selectively used by utilizing the corresponding analysis method.
Referring to
Referring to
Referring to
In operation S910, the input unit 210 may receive medical data of the examinee. The medical data received by the input unit 210 may include clinical records, genes and genetic variants, and MRI. The data may be inputted in a computer-readable form. The input unit 210 may preprocess the medical data in a form that can be processed by the selection unit 220 or the disease detection unit 230 and transfer the preprocessed medical data.
A data analysis method for disease diagnosis aid according to an embodiment of the present invention may include selecting the disease-related data using the medical data (S920).
In operation S920, the selection unit 220 may receive the medical data from the input unit 210. The selection unit 220 may select disease-related data using the medical data. Information included in the medical data may be selected.
The selection unit 220 may select a variant having a possibility of disease association among all gene variants possessed by the subject. The selection unit 220 may select the subject's brain region volume value, white matter damage volume value, cortex and subcortical region T2 high signal damage volume value, and myelination index from MRI data.
A data analysis method for disease diagnosis aid according to an embodiment of the present invention may include calculating the disease probability according to the selected disease-related data (S930).
In operation S930, the disease detection unit 230 may calculate the disease probability according to the selected disease-related data. The disease detection unit 230 may provide an expected disease according to the probability of the disease. The disease detection unit 230 may calculate a disease probability according to a plurality of types of the disease-related data, and may determine a disease probability or a predicted disease in consideration of the calculated multiple disease probability.
The disease detection unit 230 synthesizes the results of pathogenicity prediction tools to determine the probability that the variant vj is a pathogenic variant with respect to the selected genome variants and calculates it as P(vj=pathogenic variant|prediction result of pathogenicity of vj).
If the gene gi variant has multiple vj, the disease detection unit 230 may obtain the disease gene probability of this gene gi as the maximum value of the pathogenic variant probability of each variant as follows. P(gi=disease gene)=max(P(vj=pathogenic variant|pathogenicity prediction result of vj))
The disease detection unit 230 may obtain the average rank ri of the disease gene probability P(gi=disease gene) of each gene gi, for the disease candidate genes possessed by the subject.
If the disease candidate genes possessed by the subject are N, the disease detection unit 230 may calculate the normalized disease gene probability PN(gi=disease gene) of the gene gi as shown in Equation 1 below.
1−(ri−1)/max(ri) [Equation 1]
If the disease gene specified in the Evidence is gk, since it is clear that this Evidence disease gene is gk, the disease detection unit 230 may assume that the normalized disease gene probability is 1. At this time, the genetic variant based similarity between the patient and this Evidence can be determined as min(PN(gk=disease gene), 1). 1) However, this is the case where the patient's allelic status and genetic pattern of the gene gk variant are consistent with those specified in Evidence, otherwise the similarity can be determined as 0. If the comparison with the subject is another patient B whose disease gene has not been identified, the genetic variant similarity in terms of gk between the two patients can be determined as follows. min(PN(gk=disease gene), PNB(gk=disease gene)).
The disease detection unit 230 may satisfy all of the following criteria for variants that may be associated with disease. 1) located in the exonic or splicing region, 2) should not be a synonymous variant, 3) the frequency of detection is less than 0.5% in all known population cohorts. It should be listed as a disease-causing gene in OMIM, and the allelic status of the variant should be consistent with the genetic pattern of the corresponding disease.
In order to calculate the pathogenic probability of each variant, the disease detection unit 230 may utilize linVar pathogenicity information and prediction information of the following pathogenicity prediction tools. SIFT, Polyphen2, LRT, MutationTaster, MutationAssessor, FATHMM, RadialSVM, LR
In the disease detection unit 230, the probability P(vj=pathogenic variant|pathogenicity prediction result of vj) that a variant vj is a pathogenic variant can be obtained by averaging Pt(vj=pathogenicity prediction result of vj by pathogenic variant|t) obtained by each prediction tool t. At this time, Pt(vj=pathogenicity prediction result of vj by pathogenic variant|t) can be calculated as follows by Bayes' theorem. Pt(vj=pathogenic variant|pathogenicity prediction result of vj by t)=Pt(pathogenicity prediction result of vj by t|vj=pathogenic variant)×P(vj=pathogenic variant)/P(pathogenicity prediction result of vj by t)
Pt(pathogenicity prediction result of vj by t|vj=pathogenic variant) for use in the above calculation can be calculated by assuming that the older version of the two versions of ClinVar having different differences is prediction and the latest version is actual variant information.
P(vj=pathogenic variant) and P(pathogenicity prediction result of vj by t) can be estimated from 69,499,850 gene variants present in a total of 127 patient whole exome-sequencing data.
The disease detection unit 230 may calculate the similarity through the phenotype-based similarity evaluation of the clinical information. The disease detection unit 230 may calculate a disease probability using the similarity. The disease detection unit 230 may present the predicted disease using the similarity or disease probability.
The disease detection unit 230 may calculate the similarity through a total of 35 phenotype term list-to-term list similarity calculation techniques according to a combination of seven phenotype term-to-term similarity evaluation techniques secured by software libraries, such as Resnick, Lin, Jiang-Conrath, relevance, information coefficient, graph IC, and Wang, and five similarity combining techniques that can be used for term set-to-term set similarity calculation, such as Max, Mean, funSim Max, FunSimAvg, and BMA.
In order to find the best of 35 similarity evaluation techniques, based on 151 patients' disease information and phenotype, the disease detection unit 230 may evaluate the ranking of the same disease by calculating phenotype similarity for other cases of each case through a leave-one-out cross-validation method.
The disease detection unit 230 may calculate a percentile of the vector-based similarity of each of the disease-related data classifications selected from MRI data of the subject and MRI data of comparison cases, and may obtain the average value of the percentile similarity calculated for each classification.
The disease detection unit 230 may obtain an average rank ri between the input case and the comparison target data based on the calculated average value of the similarity percentile, and based on this, may finally calculate the normalized similarity value 1−(ri−1)/max(ri).
The disease detection unit 230 may calculate normalized similarity values of input patient data and reference data (e.g., SNU cohort or DDD project data) in the platform for each data type through the above processes.
When all or a part of the similarity for each data type is selected and combined, the disease detection unit 230 may calculate a general similarity as an average of corresponding normalized similarity values.
A data analysis method and system for disease diagnosis aid according to another embodiment of the present invention applies a weight to each evaluation value for clinical record data, genotypic data, and MRI data to diagnose the corresponding patient with a disease with the highest probability. Meanwhile, the following equation can be used as a method of applying the weight.
Here, ecdf(x; z) is defined as an empirical cumulative distribution function for z, P means an input patient, D means a type of disease, Pr( ) means probability, w0 is a weight for genotypic data, w1 is a weight for phenotype information, w2 is a weight for MRI data, and w0+w1+w2=1 is satisfied, and each variable is defined as follows.
T: The number of prediction tools PathoPred that predict the pathogenicity of genetic variants,
θt: Weight for the t-th prediction tool PathoPredt. Σθt=1
vij: j-th variant of patient P for the i-th gene reported to induce disease D
m: The number of phenotypes observed in patient P
n: The number of phenotypes reported in disease D
phenotypePi: i-th phenotype of patient P
phenotypeDj: j-th phenotype reported in disease D
freqD(phenotype): Frequency of phenotypes reported in disease D
MRIfP: f-th feature of patient P MRI data vector
MRIfD: f-th feature of disease D MRI data vector
γf: Weight for f-th feature of MRI data vector. Σγf=1
P(vij is pathogenic|PathoPredt) represents the disease-induced probability of gene variant by each pathogenicity prediction tool, and it is possible that this probability value is estimated from previously reported pathogenic gene variant information DB and normal human gene variant information.
The values of the weights w0, w1, and w2 can be set as follows, and the weights can be adjusted according to the purpose.
More specifically, referring to
Top1 means the accuracy of the disease corresponding to the 1st rank among the predicted diseases, and Top5 means the accuracy of being the actual disease among diseases corresponding to 5th among the predicted diseases.
According to
Referring to
Compared to
More specifically, referring to
So far, the present invention has been focused on the preferred embodiments. Those skilled in the art to which the present invention pertains will appreciate that the present invention may be implemented in a modified form without departing from the essential characteristics of the present invention. Therefore, the disclosed embodiments should be considered in descriptive sense only not in limited perspective sense. The scope of the invention is defined not by the detailed description of the invention but by the appended claims, and all differences within the scope will be construed as being included in the present invention.
Number | Date | Country | Kind |
---|---|---|---|
10-2018-0150599 | Nov 2018 | KR | national |
This is a continuation-in-part of PCT/KR2018/016983, filed Dec. 31, 2018, which claims priority under 35 U.S.C. § 119 to Korean Patent Application No. 10-2018-0150599, filed on Nov. 29, 2018, which are both are hereby incorporated by reference in their entirely.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/KR2018/016983 | Dec 2018 | US |
Child | 16879584 | US |